Architecture of SQL tables - sql

I am wondering is it more useful and practical (size of DB) to create multiple tables in sql with two columns (one column containing foreign key and one column containing random data) or merge it and create one table containing multiple columns. I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
example a. one table
productID productname weight no_of_pages
1 book 130 500
2 watch 50 null
3 ring null null
example b. three tables
productID productname
1 book
2 watch
3 ring
productID weight
1 130
2 50
productID no_of_pages
1 500

The multi-table approach is more "normal" (in database terms) because it avoids columns that commonly store NULLs. It's also something of a pain in programming terms because you have to JOIN a bunch of tables to get your original entity back.
I suggest adopting a middle way. Weight seems to be a property of most products, if not all (indeed, a ring has a weight even if small and you'll probably want to know it for shipping purposes), so I'd leave that in the Products table. But number of pages applies only to a book, as do a slew of other unmentioned properties (author, ISBN, etc). In this example, I'd use a Products table and a Books table. The books table would extend the Products table in a fashion similar to class inheritance in object oriented program.
All book-specific properties go into the Books table, and you join only Products and Books to get a complete description of a book.

I think this all depends on how the tables will be used. Maybe your examples are oversimplifying things too much but it seems to me that the first option should be good enough.
You'd really use the second example if you're going to be doing extremely CPU intensive stuff with the first table and will only need the second and third tables when more information about a product is needed.
If you're going to need the information in the second and third tables most times you query the table, then there's no reason to join over every time and you should just keep it in one table.

I would suggest example a, in case there is a defined set of attributes for product, and an example c if you need variable number of attributes (new attributes keep coming every now and then) -
example c
productID productName
1 book
2 watch
3 ring
attrID productID attrType attrValue
1 1 weight 130
2 1 no_of_pages 500
3 2 weight 50
The table structure you have shown in example b is not normalized - there will be separate id columns required in second and third tables, since productId will be an fk and not a pk.

It depends on how many rows you are expecting on your PRODUCTS table. I would say that it would not make sense to normalize your tables to 3N in this case because product name, weight, and no_of_pages each describe the products. If you had repeating data such as manufacturers, it would make more sense to normalize your tables at that point.

Without knowing the background (data model), there is no way to tell which variant is more "correct". both are fine in certain scenarios.

You want three tables, full stop. That's best because there's no chance of watches winding up with pages (no pun intended) and some books without. If you normalize, the server works for you. If you don't, you do the work instead, just not as well. Up to you.
I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
That's always true of nullable columns. Here's the rule: a nullable column has an optional relationship to the key. A nullable column can always be, and usually should be, in a separate table where it can be non-null.

Related

Turn two database tables into one?

I am having a bit of trouble when modelling a relational database to an inventory managament system. For now, it only has 3 simple tables:
Product
ID | Name | Price
Receivings
ID | Date | Quantity | Product_ID (FK)
Sales
ID | Date | Quantity | Product_ID (FK)
As Receivings and Sales are identical, I was considering a different approach:
Product
ID | Name | Price
Receivings_Sales (the name doesn't matter)
ID | Date | Quantity | Type | Product_ID (FK)
The column type would identify if it was receiving or sale.
Can anyone help me choose the best option, pointing out the advantages and disadvantages of either approach?
The first one seems reasonable because I am thinking in a ORM way.
Thanks!
Personally I prefer the first option, that is, separate tables for Sales and Receiving.
The two biggest disadvantage in option number 2 or merging two tables into one are:
1) Inflexibility
2) Unnecessary filtering when use
First on inflexibility. If your requirements expanded (or you just simply overlooked it) then you will have to break up your schema or you will end up with unnormalized tables. For example let's say your sales would now include the Sales Clerk/Person that did the sales transaction so obviously it has nothing to do with 'Receiving'. And what if you do Retail or Wholesale sales how would you accommodate that in your merged tables? How about discounts or promos? Now, I am identifying the obvious here. Now, let's go to Receiving. What if we want to tie up our receiving to our Purchase Order? Obviously, purchase order details like P.O. Number, P.O. Date, Supplier Name etc would not be under Sales but obviously related more to Receiving.
Secondly, on unnecessary filtering when use. If you have merged tables and you want only to use the Sales (or Receving) portion of the table then you have to filter out the Receiving portion either by your back-end or your front-end program. Whereas if it a separate table you have just to deal with one table at a time.
Additionally, you mentioned ORM, the first option would best fit to that endeavour because obviously an object or entity for that matter should be distinct from other entity/object.
If the tables really are and always will be identical (and I have my doubts), then name the unified table something more generic, like "InventoryTransaction", and then use negative numbers for one of the transaction types: probably sales, since that would correctly mark your inventory in terms of keeping track of stock on hand.
The fact that headings are the same is irrelevant. Seeking to use a single table because headings are the same is misconceived.
-- person [source] loves person [target]
LOVES(source,target)
-- person [source] hates person [target]
HATES(source,target)
Every base table has a corresponding predicate aka fill-in-the-[named-]blanks statement describing the application situation. A base table holds the rows that make a true statement.
Every query expression combines base table names via JOIN, UNION, SELECT, EXCEPT, WHERE condition, etc and has a corresponding predicate that combines base table predicates via (respectively) AND, OR, EXISTS, AND NOT, AND condition, etc. A query result holds the rows that make a true statement.
Such a set of predicate-satisfying rows is a relation. There is no other reason to put rows in a table.
(The other answers here address, as they must, proposals for and consequences of the predicate that your one table could have. But if you didn't propose the table because of its predicate, why did you propose it at all? The answer is, since not for the predicate, for no good reason.)

Need advice on SQL philosophy

Before I go asking more questions about the coding, I'd like to first figure out the best method for me to follow for making my database. I'm running into a problem with how I should go about structuring it to keep everything minimized and due to its' nature I have lots of re-occurring data that I have to represent.
I design custom shirts and have a variety of different types of shirts for people to choose from that are available in both adult and child sizes of both genders. For example, I have crewneck shirts, raglan sleeves, ringer sleeves and hoodies which are available for men, women, boys, girls and toddlers. The prices are the same for each shirt from the toddler sizes up to 1x in the adult sizes, then 2x, 3x, 4x and 5x are each different prices. Then there's the color options for each kind of shirt which varies, some may have 4 color options, some have 32.
So lets take just the crewneck shirts for an example. Men s-1x, Women s-1x, Boys xs-1x, girls xs-1x and toddlers NB-18months is a total of 22 rows that will be represented in the table and are all the same price. 2X and up only apply to men and women so that's 8 more rows which makes 30 rows total for just the crewneck shirts. When it gets into the color options, there's 32 different colors available for them. If I were to do each and every size for all of them that would be 960 total rows just for the crewneck shirts alone with mainly HIGHLY repeated data for just one minor change.
I thought about it and figured It's best to treat these items on the table as actual items in a stock room because THEY'RE REALLY THERE in the stock room... you don't have just one box of shirts that you can punch a button on the side to turn to any size of color, you have to deal with the actual shirt and tedious task of putting them somewhere, so I deciding against trying to get outrageous with a bunch of foreign keys and indexes, besides that it gets just as tedious and you wind up having to represent just as much but with a lot more tables when you could've just put the data it's linking to in the first table.
If we take just the other 3 kinds of shirts and apply that same logic with all the colors and sizes just for those 4 shirts alone there will be 3,840 rows, with the other shirts left I'm not counting in you could say I'm looking at roughly 10,000 rows of data all in one table. This data will be growing over time and I'm wondering what it might turn into trying to keep it all organized. So I figured maybe the best logic to go with would be to break it down like the do in an actual retail store, which is to separate the departments into men, women, boys, girls and babies. so that way I have 5 separate tables that are only called when the user decides to "go to that department" so if there's a man who wants the men shirts he doesn't have 7,000+ rows of extra data present that doesn't even apply to what he's looking for.
Would this be a better way of setting it up? or would it be better to keep it all as one gigantic table and just query the "men" shirts in the php from the table in the section for men and the same with women and kids?
My next issue is all the color options that may be available, as I said before some shirts will have as few as 4 some will have as many as 32, so some of those are enough data to form a table all on their own, so I could really have a separate table for every kind of shirt. I'll be using a query in php to populate my items from the tables so I don't have to code so much in the html and javascript. That way I can set it to SELECT ALL * table WHERE type=men and it will take all the men shirts and auto populate the coding for each one. That way as I add and take things to and from the tables it'll automatically be updated. I already have an idea for HOW I'm going to do that, but I can only think so far into it because I haven't decided on a good way to set the tables up which is what I'd have to structure it to call from.
For example, if I have all the color options of each shirt all on the same table versus having it broken down and foreign keys linking to other tables to represent them. that would be two totally different ways of having to call it forth, so I'm stuck on this and don't really know where to go with it. any suggestions?
Typically retail organization is by SKU (stock keeping unit). Department and color are attributes of a garment, not the way you identify the garment for the purpose of accounting or stocking.
CREATE TABLE Skus (
sku BIGINT UNSIGNED PRIMARY KEY,
description TEXT,
department VARCHAR(10) NOT NULL,
color VARCHAR(10) NOT NULL,
qty_in_stock INT UNSIGNED NOT NULL DEFAULT 0,
unit_price NUMERIC(9,2) NOT NULL,
FOREIGN KEY (department) REFERENCES Departments(department),
FOREIGN KEY (color) REFERENCES Colors(color)
);
This is better than splitting into five tables, because:
You can quickly get a sum of the total value of all your stock.
You can switch the department of a given SKU easily.
When someone buys a few garments, their order lineitems reference a single table instead of five different tables (that would be invalid for a relational database).
There are lots of other examples of tasks that are easier if similar entities are stored in one table.
I know you don't want to break it out into separate tables, but I think going the multiple table route would be the best. However, I don't think it is as bad as you think. My suggestion would be the following. Obviously, you want to change the names of the fields, but this is a quick representation:
Shirts
- id (primary key)
- description
- men (Y/N)
- women (Y/N)
- boy (Y/N)
- girl (Y/N)
- toddlers (Y/N)
Sizes
- id (primary key)
- shirt_id (foreign key)
- Size
Colors
- id (primary key)
- shirt_id (foreign key)
- Color
Price
- id (primary key)
- shirt_id (foreign key)
- size_id (foreign key)
- price
Having these three tables makes it so that you won't have to store all 10,000 rows in one single table and maintain it, but the data is still all there. Keeping your data separated into their proper places keeps from replicating needless information.
Want to pull all men's shirts?
SELECT * FROM shirts WHERE men = '1'
To be honest, you should really have at least 5 or 6 tables. One/two containing the labels for sizes and colors (either one table containing all, or one for each one) and the other 4 containing the actual data. This will keep your data uniform across everything (example: Blue vs blue). You know what they say, there is more than one way to skin a cat.
You need to think about a database term called 'normalization'. Normalization means that everything has it's place in the database and should not be listed twice but reused as needed. The most common mistake people make is to not ask or think about what will happen down the road and they put up a database that has next to no normalization, has massive memory consumed do to large datatypes, no seeding done, and is completely inflexible and comes at a great cost to change later because it was made without thinking of the future.
There are many levels of normalization but the most consistent thing is to think about a simple example I could give you to explain some simple concepts that can be applied to larger things later. This is assuming you have access to SQL management studio, SSMS, HOWEVER if you are using MYSQL or Oracle the principles are still very similar and the comments sections will show what I am getting at. This example you can self run if you have SSMS and just paste it in and hit F5. If you don't just look at the comments section although these concepts are better to see in action than to try to just envision what they mean.
Declare #Everything table (PersonID int, OrderID int, PersonName varchar(8), OrderName varchar(8) );
insert into #Everything values (1, 1, 'Brett', 'Hat'),(1, 2, 'Brett', 'Shirt'),(1, 3, 'Brett', 'Shoes'),(2,1,'John','Shirt'),(2,2,'John','Shoes');
-- very basic normalization level in that I did not even ATTEMPT to seperate entities into different tables for reuse.
-- I just insert EVERYTHING as I get in one place. This is great for just getting off the ground or testing things.
-- but in the future you won't be able to change this easily as everything is here and if there is a lot of data it is hard
-- to move it. When you insert if you keep adding more and more and more columns it will get slower as it requires memory
-- for the rows and the columns
Select Top 10 * from #Everything
declare #Person table ( PersonID int identity, PersonName varchar(8));
insert into #Person values ('Brett'),('John');
declare #Orders table ( OrderID int identity, PersonID int, OrderName varchar(8));
insert into #Orders values (1, 'Hat'),(1,'Shirt'),(1, 'Shoes'),(2,'Shirt'),(2, 'Shoes');
-- I now have tables storing two logic things in two logical places. If I want to relate them I can use the TSQL language
-- to do so. I am now using less memory for storage of the individual tables and if one or another becomes too large I can
-- deal with them isolated. I also have a seeding record (an ever increasing number) that I could use as a primary key to
-- relate row position and for faster indexing
Select *
from #Person p
join #Orders o on p.PersonID = o.PersonID
declare #TypeOfOrder table ( OrderTypeID int identity, OrderType varchar(8));
insert into #TypeOfOrder values ('Hat'),('Shirt'),('Shoes')
declare #OrderBridge table ( OrderID int identity, PersonID int, OrderType int)
insert into #OrderBridge values (1, 1),(1,2),(1,3),(2,2),(2,3);
-- Wow I have a lot more columns but my ability to expand is now pretty flexible I could add even MORE products to the bridge table
-- or other tables I have not even thought of yet. Now that I have a bridge table I have to list a product type ONLY once ever and
-- then when someone orders it again I just label the bridge to relate a person to an order, hence the name bridge as it on it's own
-- serves nothing but relating two different things to each other. This method takes more time to set up but in the end you need
-- less rows of your database overall as you are REUSING data efficiently and effectively.
Select Top 10 *
from #Person p
join #OrderBridge o on p.PersonID = o.PersonID
join #TypeOfOrder t on o.OrderType = t.OrderTypeID

I keep messing up 1NF

For me the most understandable description of going about 1NF so far I found is ‘A primary key is a column (or group of columns) that uniquely identifies each row. ‘ on www.phlonx.com
I understand that redundancy means per key there shouldn’t be more than 1 value to each row. More than 1 value would then be ‘redundant’. Right?
Still I manage to screw up 1 NF a lot of times.
I posted a question for my online pizzashop http://foo.com
pizzashop
here
where I was confused about something in the second normal form only to notice I started off wrong in 1 NF.
Right now I’m thinking that I need 3 keys in 1NF in order to uniquely identify each row.
In this case, I’m finding that order_id, pizza_id, and topping_id will do that for me. So that’s 3 columns. Because if you want to know which particular pizza is which you need to know what order_id it has what type of pizza (pizza_id) and what topping is on there. If you know that, you can look up all the rest.
Yet, from an answer to previous question this seems to be wrong, because topping_id goes to a different table which I don’t understand.
Here’s the list of columns:
Order_id
Order_date
Customer_id
Customer_name
Phone
Promotion
Blacklist Y or N
Customer_address
ZIP_code
City
E_mail
Pizza_id
Pizza_name
Size
Pizza_price
Amount
Topping_id
Topping_name
Topping_prijs
Availabitly
Delivery_id
Delivery_zone
Deliveryguy_id
Deliveryguy_name
Delivery Y or N
Edit: I marked the id's for the first concatenated key in bold. They are only a list of columns, unnormalized. They're not 1 table or 3 tables or anything
use Object Role Modelling (say with NORMA) to capture your information about the design, press the button and it spits out SQL.
This will be easier than having you going back and forth between 1NF, 2NF etc. An ORM design is guaranteed to be in 5NF.
Some notes:
you can have composite keys
surrogate keys may be added after both conceptual and logical design: you have added them up front which is bad. They are added because of the RDBMS performance, not at design time
have you read several sources on 1NF?
start with plain english and some facts. Which is what ORM does with verbalisation.
So:
A Customer has many pizzas (zero to n)
A pizza has many toppings (zero to n)
A customer has an address
A pizza has a base
...
I'd use some more tables for this, to remove duplication for customers, orders, toppings and pizze:
Table: Customer
Customer_id
Customer_name
Customer_name
Phone
Promotion
Blacklist Y or N
Customer_address
ZIP_code
City
E_mail
Table: Order
Order_id
Order_date
Customer_id
Delivery_zone
Deliveryguy_id
Deliveryguy_name
Delivery Y or N
Table: Order_Details
Order_ID (FK on Order)
Pizza_ID (FK on Pizza)
Amount
Table: Pizza
Pizza_id
Pizza_name
Size
Pizza_price
Table: Topping
Topping_id
Topping_name
Topping_prijs
Availabitly
Table: Pizza_Topping
Pizza_ID
Topping_ID
Pizza_topping and Order_details are so-called interselection tables ("helper" tables for modelling a m:n relationship between two tables).
Now suppose we have just one pizza, some toppings and our customer Billy Smith orders 2 quattro stagione pizze - our tables will contain this content:
Pizza(Pizza_ID, Pizza_name, Pizza_price)
1 Quattro stagioni 12€
Topping(Topping_id, topping_name, topping_price)
1 Mozzarrella 0,50€
2 Prosciutto 0,70€
3 Salami 0,50€
Pizza_Topping(Pizza_ID, Topping_ID)
1 1
1 3
(here, a quattro stagioni pizza contains only Mozzarrella and Salami).
Order(order_ID, Customer_name - rest omitted)
1 Billy Smith
Order_Details(order_id, Pizza_id, amount)
1 1 2
I've removed delivery ID, since for me, there is no distinction between an Order and a delivery - or do you support partial deliveries?
On 1NF, from wikipedia, quoting Date:
According to Date's definition of 1NF,
a table is in 1NF if and only if it is
"isomorphic to some relation", which
means, specifically, that it satisfies
the following five conditions:
There's no top-to-bottom ordering to the rows.
There's no left-to-right ordering to the columns.
There are no duplicate rows.
Every row-and-column intersection contains exactly one
value from the applicable domain (and
nothing else).
All columns are regular [i.e. rows have no hidden components such as
row IDs, object IDs, or hidden
timestamps].
—Chris Date, "What First Normal Form Really Means", pp. 127–8[4]
First two are guaranteed in any modern RDBMS.
Duplicate rows are possible in modern RDBMS - however, only if you don't have primary keys (or other unique constraints).
The fourth one is the hardest one (and depends on the semantics of your model) - for example your field Customer_address might be breaking 1NF. Might be, because if you make a contract with yourself (and any potential user of the system) that you will always look at the address as a whole and will not want to separate street name, street number and or floor, you could still claim that 1NF is not broken.
It would be more proper to break the customer address, but there are complexities there with which you would then need to address and which might bring no benefit (provided that you will never have to look a the sub-atomic part of the address line).
The fifth one is broken by some modern RDBMs, however the real importance is that your model nor system should depend on hidden elements, which is normally true - even if your RDBMS uses OIDs internally for certain operations, unless you start to use them for non-administrative, non-maintenance tasks, you can consider it not breaking the 1NF.
The strengths of relational databases come from separating information into different tables. One useful way of looking at tables is first to identify as entity tables those concepts which are relatively permanent (in your case, probably Pizza, Customer, Topping, Deliveryguy). Then you think about the relations between them (in your case, Order, Delivery ). The relational tables link together the entity tables by having foreign keys pointing to the relevant entities: an Order has foreign keys to Customer, Pizza, Topping); a Delivery has foreign keys to Deliveryguy and Order. And, yes, relations can link relations, not just entities.
Only in such a context can you achieve anything like normalization. Tossing a bunch of attributes into one singular table does not make your database relational in any meaningful sense.

Basic question: how to properly redesign this schema

I am hopping on a project that sits on top of a Sql Server 2008 DB with what seems like an inefficient schema to me. However, I'm not an expert at anything SQL, so I am seeking for guidance.
In general, the schema has tables like this:
ID | A | B
ID is a unique identifier
A contains text, such as animal names. There's very little variety; maybe 3-4 different values in thousands of rows. This could vary with time, but still a small set.
B is one of two options, but stored as text. The set is finite.
My questions are as follows:
Should I create another table for names contained in A, with an ID and a value, and set the ID as the primary key? Or should I just put an index on that column in my table? Right now, to get a list of A's, it does "select distinct(a) from table" which seems inefficient to me.
The table has a multitude of columns for properties of A. It could be like: Color, Age, Weight, etc. I would think that this is better suited in a separate table with: ID, AnimalID, Property, Value. Each property is unique to the animal, so I'm not sure how this schema could enforce this (the current schema implies this as it's a column, so you can only have one value for each property).
Right now the DB is easily readable by a human, but its size is growing fast and I feel like the design is inefficient. There currently is not index at all anywhere. As I said I'm not a pro, but will read more on the subject. The goal is to have a fast system. Thanks for your advice!
This sounds like a database that might represent a veterinary clinic.
If the table you describe represents the various patients (animals) that come to the clinic, then having properties specific to them are probably best on the primary table. But, as you say column "A" contains a species name, it might be worthwhile to link that to a secondary table to save on the redundancy of storing those names:
For example:
Patients
--------
ID Name SpeciesID Color DOB Weight
1 Spot 1 Black/White 2008-01-01 20
Species
-------
ID Species
1 Cocker Spaniel
If your main table should be instead grouped by customer or owner, then you may want to add an Animals table and link it:
Customers
---------
ID Name
1 John Q. Sample
Animals
-------
ID CustomerID SpeciesID Name Color DOB Weight
1 1 1 Spot Black/White 2008-01-01 20
...
As for your original column B, consider converting it to a boolean (BIT) if you only need to store two states. Barring that, consider CHAR to store a fixed number of characters.
Like most things, it depends.
By having the animal names directly in the table, it makes your reporting queries more efficient by removing the need for many joins.
Going with something like 3rd normal form (having an ID/Name table for the animals) makes you database smaller, but requires more joins for reporting.
Either way, make sure to add some indexes.

what is the best database design for this table when you have two types of records

i am tracking exercises. i have a workout table with
id
exercise_id (foreign key into exercise table)
now, some exercises like weight training would have the fields:
weight, reps (i just lifted 10 times # 100 lbs.)
and other exercises like running would have the fields: time, distance (i just ran 5 miles and it took 1 hours)
should i store these all in the same table and just have some records have 2 fields filled in and the other fields blank or should this be broken down into multiple tables.
at the end of the day, i want to query for all exercises in a day (which will include both types of exercises) so i will have to have some "switch" somewhere to differentiate the different types of exercises
what is the best database design for this situation
There are a few different patterns for modelling object oriented inheritance in database tables. The most simple being Single table inheritance, which will probably work great in this case.
Implementing it is mostly according to your own suggestion to have some fields filled in and the others blank.
One way to do it is to have an "exercise" table with a "type" field that names another table where the exercise-specific details are, and a foreign key into that table.
if you plan on keeping it only 2 types, just have exercise_id, value1, value2, type
you can filter the type of exercise in the where clause and alias the column names in the same statment so that the results don't say value1 and value2, but weight and reps or time and distance