Normalization in sql - sql

How would i go normalize the table that has duplicate tuples?
--------------------------------------
ID | Name | Email
-----------------------------------------
1 | John | user#somedomainname
2 | John | user#somedomainname
In this case two users have same name and email.

If it has duplicate tupes it can't have a Primary Key. This is required for first normal form.

Brax: First, look closely at the entities your table is describing. Duplicate information is a common sign that you're storing two (or more) entities in the same table. Then split these out. Write a query using group by, or distinct, or some application logic to find the unique values. Ensure this by using unique constraints where appropriate. Ensure these entities have a primary key.
Second: add foreign key columns to your existing table, so that it can form a relation to the new table(s) you just created. Fill the foreign key table.
Third: Drop the columns containing the information you just offloaded to the separate entity table(s).
Since your question is very generic, so is this answer... but I hope it helps at least a little bit.

Make the fields which form the tuple a Primary Key.

Are these users assumed to be the same person? Perhaps user 1 registered, then maybe abandoned that email address or entered it incorrectly, then user 2 used the same details). If they are assumed to be different people, you need to add additional data to uniquely identify them e.g. date of registration.
Assuming that the two users are the same person, first, decide which of the two rows you want to keep (for example, always choose the one with the lowest ID, so in this case its 1). Then find all rows in other tables that reference the row you want to remove (in this case, its 2). Update all of these Foreign Keys so they point to 1 instead. Then, you can delete the row that has ID = 2.
Now that you have tidied up the data, you should amend your schema to prevent this from happening again: put a unique constraint on Name and Email (or even just on Email if you should not allow the same email to be used by two people).

Related

How to remove redundancy when a lot of columns can have repeating data?

I am working on a library system. I have a book table in which there are 12 columns including author, title, publication etc. Among those 12 there are 2 columns book_id and isbn that are unique. When a library get n copies of a same book the only columns that are going to be unique are book_id and isbn. All other columns are going to repeat n times. Is this a bad design. Is there a way to remove this redundency? I thought of creating another table where I could store uniquely all the redudent columns and seperate book_id and isbn form it. Should I do that or is the table just fine as it is?
I think you should use two different tables.
In one table you can store unique keys like book_id, isbn_id. And in another table you can use one of these book_id or isbn_id for a reference foreign key. Or other way is you can create an auto increment id also in first table for foreign key reference to second table.
In second table you can keep all other 10 columns + 1 column for reference from 1st table.
This will help you to maintain redundancy and this first table can be also used for mapping as well if in future you want to extend database tables.
What you have described sounds like two different tables:
books which defines the attributes of a book.
copies which defines the attributes of an incoming copy.
You seem to understand the attributes of books.
The attributes of copies might include (besides a foreign key reference to books)
condition
acquired_date
source
other information about the acquisition
It might also have "disposed of" date as well. Or you could maintain that in a copies_transactions table of some sort.

Why is it a bad idea to have a table without a primary key?

I am very new to data modeling, and according to Microsoft's Entity Framework, tables without primary keys are not allowed and apparently a bad idea. I am trying to figure out why this is a bad idea, and how to fix my model so that I do not have this hole.
I have 4 tables in my current model: User, City, HelloCity, and RateCity. It is modeled as shown in the picture. The idea is that many users can visit many cities, and a user can only rate a city once, but they can greet a city many times. For this reason, I did not have a PK in HelloCity table.
Any insight as to how I can change this to comply with best practices, and why this is against best practices to begin with?
This response is mainly opinion/experience-based, so I'll list a few reasons that come to mind. Note that this is not exhaustive.
Here're some reasons why you should use primary keys (PKs):
They allow you to have a way to uniquely identify a given row in a table to ensure that there're no duplicates.
The RDBMS enforces this constraint for you, so you don't have to write additional code to check for duplicates before inserting, avoiding a full table scan, which implies better performance here.
PKs allow you to create foreign keys (FKs) to create relations between tables in a way that the RDBMS is "aware" of them. Without PKs/FKs, the relationship only exists inside the programmer's mind, and the referenced table might have a row with its "PK" deleted, and the other table with the "FK" still thinks the "PK" exists. This is bad, which leads to the next point.
It allows the RDBMS to enforce integrity constraints. Is TableA.id referenced by TableB.table_a_id? If TableB.table_a_id = 5 then, you're guaranteed to have a row with id = 5 in TableA. Data integrity and consistency is maintained, and that is good.
It allows the RDBMS to perform faster searches b/c PK fields are indexed, which means that a table doesn't need to have all of its rows checked when searching for something (e.g. a binary search on a tree structure).
In my opinion, not having a PK might be legal (i.e. the RDBMS will let you), but it's not moral (i.e. you shouldn't do it). I think you'd need to have extraordinarily good/powerful reasons to argue for not using a PK in your DB tables (and I'd still find them debatable), but based on your current level of experience (i.e. you say you're "new to data modeling"), I'd say it's not yet enough to attempt justifying a lack of PKs.
There're more reasons, but I hope this gives you enough to work through it.
As far as your M:M relations go, you need to create a new table, called an associative table, and a composite PK in it, that PK being a combination of the 2 PKs of the other 2 tables.
In other words, if there's a M:M relation between tables A and B, then we create a table C that has a 1:M relation to with both tables A and B. "Graphically", it'd look similar to:
+---+ 1 M +---+ M 1 +---+
| A |------| C |------| B |
+---+ +---+ +---+
With the C table PK somewhat like this:
+-----+
| C |
+-----+
| id | <-- C.id = A.id + B.id (i.e. combined/concatenated, not addition!)
+-----+
or like this:
+-------+
| C |
+-------+
| a_id | <--|
+-------+ +-- composite PK columns instead
| b_id | <--| of concatenation (recommended)
+-------+
Two primary reasons for a primary key:
To uniquely identify a record for later reference.
To join to other tables accurately and efficiently.
A primary key essentially tags a row with a unique identifier. This can be composed of one or more columns in a row but most commonly just uses one. Part of what makes this useful is when you have additional tables (such as the one in your scenario) you can refer to this value in other tables. Since it's unique, I can look at a column with that unique ID in another table (say HelloCity) and instantly know where to look in the User table to get more information about the person that column refers to.
For example, HelloCity only stores the IDs for the User and City. Why? Because it'd be silly to re-record ALL the data about the City and ALL the data about the User in another table when you already have it stored elsewhere. The beauty of it is, say the user needs to update their DisplayName for some reason. To do so, you just need to change it in User. Now, any row that refers to the user instantly returns the new DisplayName; otherwise you would have to find every record using the old DisplayName and update it accordingly, which in larger databases could take a considerable amount of time.
Note that the primary key is only unique in that specific table though - you could theoretically see the same primary key value in your City and User tables (this is especially common if you're using simple integers as IDs) but your database will know the difference based on the relationship you build between tables as well as your JOIN statements in your queries.
Another way primary keys help is they automatically have an index generated on their column(s). This increases performance in queries where your WHERE clause searches on the primary key column value. And, since you'll likely be referring to that primary key in other tables, it makes that lookup faster as well.
In your data model I see some columns that already have 'Id' in them. Without knowing your dataset I would hope those already have all-unique values so it should be fine to place a PK on those. If you get errors doing that there are likely duplicates.
Back to your question about HelloCity - Entity Framework is a little finicky when it comes to keys. If you really want to play it safe you can auto-generate a unique ID for every entry and call it good. This makes sense because it's a many-to-many relationship, meaning that any combination can appear any number of times, so in theory there's no reliable way to distinguish between unique entries. In the event you want to drop a single entry in the future, how would you know what row to refer to? You could make the argument that you search on all the fields and the greeting may be different, but if there are multiple visits to a city with the same greeting, you may accidentally drop all of those records instead of just one.
However, if it was a one-to-one relationship you could get away with making the combination of both CityId and UserId the primary key since that combination should always be unique (because you should never see multiple rows making that same combination).
Late to the party but I wanted to add there are special cases where a table does not need to have a primary key, or any type of key.
Take the case of a singleton, for example. A table that includes a single row (or a very well known number of rows) always. The dual table in Oracle is one case.
Formally, the primary key for a singleton is (): that is, a key with no columns. I'm not aware of any database that allows it, though.
There are other cases where a PK is not needed, typically with log tables that usually are "end tables" since you typically draw them at the border of your diagram; no other tables refer to them (i.e. they have no children). A good use of indexes is enough to deal with them since, by their nature, they don't need to enforce row uniqueness.
But, to close, yes 99.99% of tables in a relational database should have a PK.

Table needs more than one identifier

I am in no way a SQL expert so I am sure I did something wrong. I have read a few questions on here about needing a primary key. The way I created this table I can't find a way to actually have a unique key. It is a survey type database. I have a table for the main details like date, triage number, and the person involved. Another table for the questions results and another for the comments. I would have made the triage unique but more than one person can be involved so the same triage number would be used more than once. The people involved can appear more than once as well. The only truly unique thing is combining the person with the triage. I thought about an auto key but it would serve no purpose. Can using two identifiers be an acceptable practice for a survey type table?
The important part:
"... more than one person can be involved so the same triage number would be used more than once. The people involved can appear more than once as well."
Based on your comments, data in these two fields, for example:
Triage Person
------ ------
1 PersonA
1 PersonB
...
7 PersonA
7 PersonB
is fine in that Triage and Person can make a composite key, provided each person recorded in the Person field is uniquely identifiable. That is, if ea. person value is a name like "John Smith", you may have a problem if there are 2 or more John Smiths answering the survey. So, your Person value itself has to identify people uniquely. Assuming the triage nos. are distinguished (i.e., no triage no. represents more than one semantically-relevant triage position), these two fields as the composite key will work for you if and only if at no time does your survey create more than one unique triage-person combination.
The foreign key for each of your other tables ought to be the main table's composite key combination, but if the other two tables can be merged into the main one, consider it to reduce join burdens. E.g.: if the comments table stores only comments in a single field and nothing more, why not include that field in the main table and get rid of the comments table?
Your question is quite general and I don't have enough information to give you a definite answer but hopefully my comments below can help.
It is not a problem to use a composite primary key (key consisting of 2 or more columns). It is more often used in linking tables, e.g. in many-to-many relationships.
One thing that you should consider is that if you want to also refer to a table with a composite primary key from other tables, you will have to refer to 2 columns in the foreign key, all the joins, etc. It may be easier to create a separate column for a primary key (e.g. autoincrementing number).

Is ID column required in SQL?

Traditionally I have always used an ID column in SQL (mostly mysql and postgresql).
However I am wondering if it is really necessary if the rest of the columns in each row make in unique. In my latest project I have the "ID" column set as my primary key, however I never call it or use it in any way, as the data in the row makes it unique and is much more useful for me.
So, if every row in a SQL table is unique, does it need a primary key ID table, and are there ant performance changes with or without one?
Thanks!
EDIT/Additional info:
The specific example that made me ask this question is a table I am using for a many-to-many-to-many-to-many table (if we still call it that at that point) it has 4 columns (plus ID) each of which represents an ID of an external table, and each row will always be numeric and unique. only one of the columns is allowed to be null.
I understand that for normal tables an ID primary key column is a VERY good thing to have. But I get the feeling on this particular table it just wastes space and slows down adding new rows.
If you really do have some pre-existing column in your data set that already does uniquely identify your row - then no, there's no need for an extra ID column. The primary key however must be unique (in ALL circumstances) and cannot be empty (must be NOT NULL).
In my 20+ years of experience in database design, however, this is almost never truly the case. Most "natural" ID's that appear to be unique aren't - ultimately. US Social Security Numbers aren't guaranteed to be unique, and most other "natural" keys end up being almost unique - and that's just not good enough for a database system.
So if you really do have a proper, unique key in your data already - use it! But most of the time, it's easier and more convenient to have just a single surrogate ID that you can guarantee will be unique over all rows.
Don't confuse the logical model with the implementation.
The logical model shows a candidate key (all columns) which could makes your primary key.
Great. However...
In practice, having a multi column primary key has downsides: it's wide, not good when clustered etc. There is plenty of information out there and in the "related" questions list on the right
So, you'd typically
add a surrogate key (ID column)
add a unique constraint to keep the other columns unique
the ID column will be the clustered key (can be only one per table)
You can make either key the primary key now
The main exception is link or many-to-many tables that link 2 ID columns: a surrogate isn't needed (unless you have a braindead ORM)
Edit, a link: "What should I choose for my primary key?"
Edit2
For many-many tables: SQL: Do you need an auto-incremental primary key for Many-Many tables?
Yes, you could have many attributes (values) in a record (row) that you could use to make a record unique. This would be called a composite primary key.
However it will be much slower in general because the construction of the primary index will be much more expensive. The primary index is used by relational database management systems (RDBMS) not only to determine uniqueness, but also in how they order and structure records on disk.
A simple primary key of one incrementing value is generally the most performant and the easiest solution for the RDBMS to manage.
You should have one column in every table that is unique.
EDITED...
This is one of the fundamentals of database table design. It's the row identifier - the identifier identifies which row(s) are being acted upon (updated/deleted etc). Relying on column combinations that are "unique", eg (first_name, last_name, city), as your key can quickly lead to problems when two John Smiths exist, or worse when John Smith moves city and you get a collision.
In most cases, it's best to use a an artificial key that's guaranteed to be unique - like an auto increment integer. That's why they are so popular - they're needed. Commonly, the key column is simply called id, or sometimes <tablename>_id. (I prefer id)
If natural data is available that is unique and present for every row (perhaps retinal scan data for people), you can use that, but all-to-often, such data isn't available for every row.
Ideally, you should have only one unique column. That is, there should only be one key.
Using IDs to key tables means you can change the content as needed without having to repoint things
Ex. if every row points to a unique user, what would happen if he/she changed his name to let say John Blblblbe which had already been in db? And then again, what would happen if you software wants to pick up John Blblblbe's details, whose details would be picked up? the old John's or the one ho has changed his name? Well if answer for bot questions is 'nothing special gonna happen' then, yep, you don't really need "ID" column :]
Important:
Also, having a numeric ID column with numbers is much more faster when you're looking for an exact row even when the table hasn't got any indexing keys or have more than one unique
If you are sure that any other column is going to have unique data for every row and isn't going to have NULL at any time then there is no need of separate ID column to distinguish each row from others, you can make that existing column primary key for your table.
No, single-attribute keys are not essential and nor are surrogate keys. Keys should have as many attributes as are necessary for data integrity: to ensure that uniqueness is maintained, to represent accurately the universe of discourse and to allow users to identify the data of interest to them. If you have already identified a suitable key and if you don't find any real need to create another one then it would make no sense to add redundant attributes and indexes to your table.
An ID can be more meaningful, for an example an employee id can represent from which department he is, year of he join and so on. Apart from that RDBMS supports lots operations with ID's.

Multiple foreign keys to a single column

I'm defining a database for a customer/ order system where there are two highly distinct types of customers. Because they are so different having a single customer table would be very ugly (it'd be full of null columns as they are pointless for one type).
Their orders though are in the same format. Is it possible to have a CustomerId column in my Order table which has a foreign key to both the Customer Types? I have set it up in SQL server and it's given me no problems creating the relationships, but I'm yet to try inserting any data.
Also, I'm planning on using nHibernate as the ORM, could there be any problems introduced by doing the relationships like this?
No, you can't have a single field as a foreign key to two different tables. How would you tell where to look for the key?
You would at least need a field that tells what kind of user it is, or two separate foreign keys.
You could also put the information that is common for all users in one table and have separate tables for the information that is specific for the user types, so that you have a single table with user id as primary key.
A foreign key can only reference a single primary key, so no. However, you could use a bridge table:
CustomerA <---- CustomerA_Orders ----> Order
CustomerB <---- CustomerB_Orders ----> Order
So Order doesn't even have a foreign key; whether this is desirable, though...
I inherited a SQL Server database where this was done (a single column used in four foreign key relationships with four unrelated tables), so yes, it's possible. My predecessor is gone, though, so I can't ask why he thought it was a good idea.
He used a GUID column ("uniqueidentifier" type) to avoid the ambiguity problem, and he turned off constraint checking on the foreign keys, since it's guaranteed that only one will match. But I can think of lots of reasons that you shouldn't, and I haven't thought of any reasons you should.
Yours does sound like the classical "specialization" problem, typically solved by creating a parent table with the shared customer data, then two child tables that contain the data unique to each class of customer. Your foreign key would then be against the parent customer table, and your determination of which type of customer would be based on which child table had a matching entry.
You can create a foreign key referencing multiple tables. This feature is to allow vertical partioining of your table and still maintain referential integrity. In your case however, this is not applicable.
Your best bet would be to have a CustomerType table with possible columns - CustomerTypeID, CustomerID, where CustomerID is the PK and then refernce your OrderID table to CustomerID.
Raj
I know this is a very old question; however if other people are finding this question through the googles, and you don't mind adding some columns to your table, a technique I've used (using the original question as a hypothetical problem to solve) is:
Add a [CustomerType] column. The purpose of storing a value here is to indicate which table holds the PK for your (assumed) [CustomerId] FK column. Optional - addition of a check constraint (to ensure CustomerType is in CustomerA or CustomerB) will help you sleep better at night.
Add a computed column for each [CustomerType], eg:
[CustomerTypeAId] as case when [CustomerType] = 'CustomerA' then [CustomerId] end persisted
[CustomerTypeBId] as case when [CustomerType] = 'CustomerB' then [CustomerId] end persisted
Add your foreign keys to the calculated (and persisted) columns.
Caveat: I'm primarily in a MSSQL environment; so I don't know how well this translates to other DBMS (ie: Postgres, ORACLE, etc).
As noted, if the key is, say, 12345, how would you know which table to look it up in? You could, I suppose, do something to insure that the key values for the two tables never overlapped, but this is too ugly and painful to contemplate. You could have a second field that says which customer type it is. But if you're going to have two fields, why not have one field for customer type 1 id and another for customer type 2 id.
Without knowing more about your app, my first thought is that you really should have a general customer table with the data that is common to both, and then have two additional tables with the data specific to each customer type. I would think that there must be a lot of data common to the two -- basic stuff like name and address and customer number at the least -- and repeating columns across tables sucks big time. The additional tables could then refer back to the base table. As there is then a single key for the base table, the issue of foreign keys having to know which table to refer to evaporates.
Two distinct types of customer is a classic case of types and subtypes or, if you prefer, classes and subclasses. Here is an answer from another question.
Essentially, the class-table-inheritance technique is like Arnand's answer. The use of the shared-primary-key technique is what allows you to get around the problems created by two types of foreign key in one column. The foreign key will be customer-id. That will identify one row in the customer table, and also one row in the appropriate kind of customer type table, as the case may be.
Create a "customer" table include all the columns that have same data for both types of customer.
Than create table "customer_a" and "customer_b"
Use "customer_id" from "consumer" table as foreign key in "customer_a" and "customer_b"
customer
|
---------------------------------
| |
cusomter_a customer_b