Database Design related to Categories and Subcategories - sql

I have researched this to no end. I am not the only person who has asked this question... but I would like your thoughts regarding the best practice.
I'm trying to design a Database that will track financial transactions. For the sake of simplicity, each transaction can only have one Category, and each category can only have one Sub-category.
I have a self-referencing table, like this:
Table: Categories
ID, int, primary key
parentID, int, foreign key
description, text
Long story short, you end up with data like this:
1 Auto [null]
2 Bills [null]
3 Healthcare [null]
4 Maintenance 1
5 Gasoline 1
6 Cell Phone 2
7 Rent 2
8 Prescriptions 3
9 Dentist 3
So far, so good. Here is my problem:
I don't know the proper way I'm supposed to relate this all back to my Transactions table. 'Transactions' has a column for 'Category' and 'Subcategory'. Transaction.ID would be the PK, and Categories.ID would be the FK.
With Transactions related to Categories in the manner specified above, that means any value from Categories could be written to Category or Subcategory...
Is it my responsibility as the programmer to control access to the table via a form? In other words, is my only option 'programmatically controlling' what goes into the Category and Subcategory columns?
Remember, each Category can only have one Subcategory. The selected Category should only allow that Category's children...
Am I making sense?
GOOD: Auto -- Maintenance
BAD: Healthcare -- Gasoline

The case you pose is subset of the more general problem of encoding hierarchical data, tree structures, in relational tables. This case has been studied in great detail ever since relational databases first made the scene in the late 1970s.
In bookkeeping systems in particular, the idea of subcategories and categories comes up, every single time. Larger scale industrial systems tend to have a four level system, with overall account type (Expenses), Category (Transportation), Subcategory (Automotive), and sub-subcategory (Gasoline).
Your research might be more productive if you used the following search terms: "Tree structure in relational design". That search yielded the following Wikipedia summary:
http://en.wikipedia.org/wiki/Hierarchical_database_model
You can find lots of related questions and answers here in SO. Search under "nested sets" or "adjancency lists" for a couple of techniques.
Your problem is going to be to simplify the answers you will find down to the case where there are only two levels: category and subcategory.
I think whatever design you choose will want to make the following rule explicit: Subcategory determines category. and you will, IMO, want the DBMS to enforce this rule so that no transaction ends up with a subcategory that is inconsistent with its category.

So your categorizations are not orthogonal and independent (such as gender and city), but rather hierarchical (such as State and County).
In order to enforce a hierarchical categorizations, use a single categorization table with a ID column as primary key, referenced as a foreign key in the data table, and two descriptive fields Category and Subcategory.
To facilitate data entry you might supply a combo box Category which filters the available subcategories. However the actual foreign key reference is supplied by the selection made in the Subcategory combo box, which must list both fields, Category and Subcategory. It would be usual to concatenate these two fields with a delimiter such as dash (-) or pipe(|).

Related

Why need to add ID field to products categories of of online shopping database?

I am just started to learn about relational database. When I studied the database of online shopping websites, I found that many examples create a category table and added ID field to the category name. I don't know why they need to create a category table and use category ID as a foreign key to relate products table. What will happen if I remove the category table and add the category name directly to the products table?
What I think is a lot of cases that you want a website menu showing your categories. This menu allows people to view your categories (Men Clothing, Women Clothing, Kids, Accessories) and once they click it they can see the products relevant to them.
If you put the category name to the product, it is very hard for you to update your menu content as you need to loop, group the category in the product table. Also, it is harder to update the category name in product table as a category name could be in lots of product records,
Whereas if you have a category table, you just need to maintain the category table (view what you have in the category table and update DB record if you want your menu change).
In long term maintenance, category table is desired.
In a case I have come over that I would like an empty category which just to show in the website menu (a menu item which contains no product) which is not possible if I do not have a category table.
By inserting just the category name you may complete your POC but you need to understand what is Normalization and why it is needed.
First normal form (1NF) : An entity type is in 1NF when it contains no repeating groups of data.
Second normal form (2NF) : An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its primary key.
Third normal form (3NF) : An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the primary key.
Source
What will happen if I remove the category table and add the category name directly to the products table?
Suppose you store the category with each product, and one day your boss tells you that you misspelled a category name. Which one?
"Theater" he says. Or did he say "theatre?" Which is correct? You check and find about "theater" and "theatre" are used close to evenly among the products that have either one.
So which spelling did your boss mean is the mistake, and which one is correct?
If you store the correct spelling in one place, in its own categories table, then you can be sure. You can correct it, and all the products that reference it will implicitly get the correction.
That's an argument for normalization, but keep in mind using an integer id is only a convention. It has nothing to do with normalization. You can use a string as a primary key of a table, and therefore you can use a string as a foreign key in a table that references it.
It's okay to use a non-integer for key columns. As long as there is one instance that stores the canonical value, it satisfies the goal of normalization -- that is to reduce data anomalies.

SQL schema many to many relationships

I am working on a school project and am running into some trouble with my understanding of building SQL schema ER diagrams.
I have 4 entities; a customer table in a one to many relationship with my order table, an order table with a one to many relationship with my line item table, a line item table that is currently in a one to many relationship with a burger table.
The problem I'm running into is that my line item table has a composite primary key of order id, burger id and I believe that line item->burger should be in a many to many relationship because one or more line items can be associated with one or more burger and vice-a-versa but my understanding is that you do not want any many to many relationships in a normalized tabled. Are my assumptions that it's supposed to be many to many correct and why am I allowed to have a many to many relationship?
For reference my table is currently in 3NF form
Ignoring modality and assuming it's always a minimum of one, if my Burger entity has a one to many relationship with line item then I'll have the following
Which means that one burger has one or more line items associated with it, however my thought process tells me that this isn't the only case, and in fact one line item can have more than one burger but from everything im reading this is not the case. The composite keys have me really confused, and I do see why it may be in a burger->line item one->many due to the burger count field which pertains to only a single burger type (it holds the value of how many burgers are in the line item).
What I believe it should look like:
Edit:
Please ignore the customer and order tables, they were only included to provide a full view and I did notice that I forgot to include foreign keys just now.
I believe that line item->burger should be in a many to many
Why would a single line item be related to multiple burgers? It wouldn't. An order has many line items. Each line item refers to a single burger. The top diagram is correct.
Note that normally you sell things other than burgers, and you have line items that don't refer to a product, like taxes, discounts, etc. So often the PK of the items table is just (OrderId,OrderItemId), and the ProductID is a non-key attribute.

Numeric IDs vs. String IDs

I'm using a very stripped down example here so please ask if you need more context.
I'm in the process of restructuring/normalising a database where the ID fields in the majority of the tables have primary key fields which are auto-incremented numerical ID's (1,2,3 etc.) and I'm thinking I need to change the ID field from a numerical value to a string value generated from data in the row.
My reasoning for this is as follows:
I have 5 tables; Staff, Members, Volunteers, Interns and Students; all of these have numeric ID's.
I have another table called BuildingAttendance which logs when people visited the premises and for what reason which has the following relevant fields:
ID Type Premises Attended
To differentiate between staff and members. I use the type field, using MEM for member and STA for staff, etc. So as an example:
ID Type Premises Attended
1 MEM Building A 27/6/15
1 STA Building A 27/6/15
2 STU Building B 27/6/15
I'm thinking it might be a better design design to use an ID similar to the following:
ID Premises Attended
MEM1 Building A 27/6/15
STA1 Building A 27/6/15
STU2 Building B 27/6/15
What would be the best way to deal with this? I know that if my primary key is a string my query performance may take a hit, but is this easier than having 2 columns?
tl;dr - How should I deal a table that references records from other tables with the same ID system?
Auto-incremented numeric ids have several advantages over strings:
They are easier to implement. In order to generate the strings (as you want them), you would need to implement a trigger or computed column.
They occupy a fixed amount of storage (probably 4 bytes), so they are more efficient in the data record and in indexes.
They allow members to change between types, without affecting the key.
The problem that you are facing is that you have subtypes of a supertype. This information should be stored with the person, not in the attendance record (unless a person could change their type with each visit). There are several ways to approach this in SQL, none as clean as simple class inheritance in a programming language.
One technique is to put all the data in a single table called something like Persons. This would have a unique id, a type, and all the columns from your five tables. The problem is when the columns from your subtables are very different.
In that case, have a table called persons with a unique primary key and the common columns. Then have separate tables for each one and use the PersonId as the primary key for these tables.
The advantage to this approach is that you can have a foreign key reference to Persons for something like BuildingAttendance. And, you can also have foreign key references to each of the subtypes, for other tables where appropriate.
Gordon Linoff already provided an answer that points out the type/supertype issue. I refer to this a class/subclass, but that's just a difference in terminology.
There are two tags in this area that collect questions that relate to class/subclass. Here they are:
class-table-inheritance
shared-primary-key
If you will look over the info tab for each of these tags, you'll see a brief outline. Plus the answers to the questions will help you with your case.
By creating a single table called Person, with an autonumber ID, you provide a handy way of referencing a person, regardless of that person's type. By making the staff, member, volunteer, student, and intern tables use a copy of this ID as their own ID you will facilitate whatever joins you need to perform.
The decision about whether to include type in attendance depends on whether you want to retrieve the data with the person's current type, or with the type the person had at the time of the attendance.

Structuring Categories in SQL Server 2008 R2

Hello I looked at a few similar posts to what I am looking to do but none are the same to what I need to accomplish. I am trying to come up with my structure for categories using SQL Server 2008 R2.
I want to make categories for lets say...Clothing, Electronics, Furniture, Tools......and so on.
I am looking at a 3 field table to start with a category table (category ID (PK), categoryname, parentID) which from what I am finding is a standard practice and can go several layers deep without having to restructure.
The problem lies where it is fine for lets say (electronics-cd players-cd changer), (electronics-lighting-studio lighting) or (clothing-womens-skirts), (clothing-womens-pants) perhaps one level deeper?
What do I do for brands? I was planning to have a brand table (brandID(PK),Brand)
then Category_Brand table (categoryID, BrandID) to link brands to categories when I want to use a cascading dropdown list that populates from the database.
What do I do for deeper attributes where the rest of the attributes apply to the item itself, but are dependent on the category? color, pattern, material, size? which can apply to clothing, but not to electronics or tools, also Mens clothing has different sizing than womens clothing.
Or furniture where I want to store dresser dimensions and color, or beds where I want to store bed size (king, queen, twin) and to store the type (Spring, air, foam, water)
What i need is to connect the item specific attributes to each item based on which category the item belongs to. On another forum I was suggested to just add all the misc. attributes to the item table and leave the ones I don't use null. I know that doesn't make sense, it seems to me that there should be different sub-attribute tables with fields that are related to the categories that they represent. i am thinking that clothing size for example would have a lookup table where each size has a (sizeid) and a link table for a many to many type relationship to connect the size with the (itemid), although there would need to be a few different size tables because men's sizes and women's sizes are different or put then all in one table with the (categoryid) as a sort of parent foreign key, and dimensions for another item like (length, width, height) would be stored into its own table along with the (itemid) as the foreign key?
Or is it a good idea to store the (sizeid) or (dimensionid) right into the item table?
This seemed to be simple to me when I started, but the more I look at it the more I am getting confused as to the correct way to structure this, I want it to work good for performance as this may become a high volume application. But doesn't everyone wish that?
try to understand normalization first. Here is a good article for you.

Normalization in plain English

I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.