Need advice on SQL philosophy - sql

Before I go asking more questions about the coding, I'd like to first figure out the best method for me to follow for making my database. I'm running into a problem with how I should go about structuring it to keep everything minimized and due to its' nature I have lots of re-occurring data that I have to represent.
I design custom shirts and have a variety of different types of shirts for people to choose from that are available in both adult and child sizes of both genders. For example, I have crewneck shirts, raglan sleeves, ringer sleeves and hoodies which are available for men, women, boys, girls and toddlers. The prices are the same for each shirt from the toddler sizes up to 1x in the adult sizes, then 2x, 3x, 4x and 5x are each different prices. Then there's the color options for each kind of shirt which varies, some may have 4 color options, some have 32.
So lets take just the crewneck shirts for an example. Men s-1x, Women s-1x, Boys xs-1x, girls xs-1x and toddlers NB-18months is a total of 22 rows that will be represented in the table and are all the same price. 2X and up only apply to men and women so that's 8 more rows which makes 30 rows total for just the crewneck shirts. When it gets into the color options, there's 32 different colors available for them. If I were to do each and every size for all of them that would be 960 total rows just for the crewneck shirts alone with mainly HIGHLY repeated data for just one minor change.
I thought about it and figured It's best to treat these items on the table as actual items in a stock room because THEY'RE REALLY THERE in the stock room... you don't have just one box of shirts that you can punch a button on the side to turn to any size of color, you have to deal with the actual shirt and tedious task of putting them somewhere, so I deciding against trying to get outrageous with a bunch of foreign keys and indexes, besides that it gets just as tedious and you wind up having to represent just as much but with a lot more tables when you could've just put the data it's linking to in the first table.
If we take just the other 3 kinds of shirts and apply that same logic with all the colors and sizes just for those 4 shirts alone there will be 3,840 rows, with the other shirts left I'm not counting in you could say I'm looking at roughly 10,000 rows of data all in one table. This data will be growing over time and I'm wondering what it might turn into trying to keep it all organized. So I figured maybe the best logic to go with would be to break it down like the do in an actual retail store, which is to separate the departments into men, women, boys, girls and babies. so that way I have 5 separate tables that are only called when the user decides to "go to that department" so if there's a man who wants the men shirts he doesn't have 7,000+ rows of extra data present that doesn't even apply to what he's looking for.
Would this be a better way of setting it up? or would it be better to keep it all as one gigantic table and just query the "men" shirts in the php from the table in the section for men and the same with women and kids?
My next issue is all the color options that may be available, as I said before some shirts will have as few as 4 some will have as many as 32, so some of those are enough data to form a table all on their own, so I could really have a separate table for every kind of shirt. I'll be using a query in php to populate my items from the tables so I don't have to code so much in the html and javascript. That way I can set it to SELECT ALL * table WHERE type=men and it will take all the men shirts and auto populate the coding for each one. That way as I add and take things to and from the tables it'll automatically be updated. I already have an idea for HOW I'm going to do that, but I can only think so far into it because I haven't decided on a good way to set the tables up which is what I'd have to structure it to call from.
For example, if I have all the color options of each shirt all on the same table versus having it broken down and foreign keys linking to other tables to represent them. that would be two totally different ways of having to call it forth, so I'm stuck on this and don't really know where to go with it. any suggestions?

Typically retail organization is by SKU (stock keeping unit). Department and color are attributes of a garment, not the way you identify the garment for the purpose of accounting or stocking.
CREATE TABLE Skus (
sku BIGINT UNSIGNED PRIMARY KEY,
description TEXT,
department VARCHAR(10) NOT NULL,
color VARCHAR(10) NOT NULL,
qty_in_stock INT UNSIGNED NOT NULL DEFAULT 0,
unit_price NUMERIC(9,2) NOT NULL,
FOREIGN KEY (department) REFERENCES Departments(department),
FOREIGN KEY (color) REFERENCES Colors(color)
);
This is better than splitting into five tables, because:
You can quickly get a sum of the total value of all your stock.
You can switch the department of a given SKU easily.
When someone buys a few garments, their order lineitems reference a single table instead of five different tables (that would be invalid for a relational database).
There are lots of other examples of tasks that are easier if similar entities are stored in one table.

I know you don't want to break it out into separate tables, but I think going the multiple table route would be the best. However, I don't think it is as bad as you think. My suggestion would be the following. Obviously, you want to change the names of the fields, but this is a quick representation:
Shirts
- id (primary key)
- description
- men (Y/N)
- women (Y/N)
- boy (Y/N)
- girl (Y/N)
- toddlers (Y/N)
Sizes
- id (primary key)
- shirt_id (foreign key)
- Size
Colors
- id (primary key)
- shirt_id (foreign key)
- Color
Price
- id (primary key)
- shirt_id (foreign key)
- size_id (foreign key)
- price
Having these three tables makes it so that you won't have to store all 10,000 rows in one single table and maintain it, but the data is still all there. Keeping your data separated into their proper places keeps from replicating needless information.
Want to pull all men's shirts?
SELECT * FROM shirts WHERE men = '1'
To be honest, you should really have at least 5 or 6 tables. One/two containing the labels for sizes and colors (either one table containing all, or one for each one) and the other 4 containing the actual data. This will keep your data uniform across everything (example: Blue vs blue). You know what they say, there is more than one way to skin a cat.

You need to think about a database term called 'normalization'. Normalization means that everything has it's place in the database and should not be listed twice but reused as needed. The most common mistake people make is to not ask or think about what will happen down the road and they put up a database that has next to no normalization, has massive memory consumed do to large datatypes, no seeding done, and is completely inflexible and comes at a great cost to change later because it was made without thinking of the future.
There are many levels of normalization but the most consistent thing is to think about a simple example I could give you to explain some simple concepts that can be applied to larger things later. This is assuming you have access to SQL management studio, SSMS, HOWEVER if you are using MYSQL or Oracle the principles are still very similar and the comments sections will show what I am getting at. This example you can self run if you have SSMS and just paste it in and hit F5. If you don't just look at the comments section although these concepts are better to see in action than to try to just envision what they mean.
Declare #Everything table (PersonID int, OrderID int, PersonName varchar(8), OrderName varchar(8) );
insert into #Everything values (1, 1, 'Brett', 'Hat'),(1, 2, 'Brett', 'Shirt'),(1, 3, 'Brett', 'Shoes'),(2,1,'John','Shirt'),(2,2,'John','Shoes');
-- very basic normalization level in that I did not even ATTEMPT to seperate entities into different tables for reuse.
-- I just insert EVERYTHING as I get in one place. This is great for just getting off the ground or testing things.
-- but in the future you won't be able to change this easily as everything is here and if there is a lot of data it is hard
-- to move it. When you insert if you keep adding more and more and more columns it will get slower as it requires memory
-- for the rows and the columns
Select Top 10 * from #Everything
declare #Person table ( PersonID int identity, PersonName varchar(8));
insert into #Person values ('Brett'),('John');
declare #Orders table ( OrderID int identity, PersonID int, OrderName varchar(8));
insert into #Orders values (1, 'Hat'),(1,'Shirt'),(1, 'Shoes'),(2,'Shirt'),(2, 'Shoes');
-- I now have tables storing two logic things in two logical places. If I want to relate them I can use the TSQL language
-- to do so. I am now using less memory for storage of the individual tables and if one or another becomes too large I can
-- deal with them isolated. I also have a seeding record (an ever increasing number) that I could use as a primary key to
-- relate row position and for faster indexing
Select *
from #Person p
join #Orders o on p.PersonID = o.PersonID
declare #TypeOfOrder table ( OrderTypeID int identity, OrderType varchar(8));
insert into #TypeOfOrder values ('Hat'),('Shirt'),('Shoes')
declare #OrderBridge table ( OrderID int identity, PersonID int, OrderType int)
insert into #OrderBridge values (1, 1),(1,2),(1,3),(2,2),(2,3);
-- Wow I have a lot more columns but my ability to expand is now pretty flexible I could add even MORE products to the bridge table
-- or other tables I have not even thought of yet. Now that I have a bridge table I have to list a product type ONLY once ever and
-- then when someone orders it again I just label the bridge to relate a person to an order, hence the name bridge as it on it's own
-- serves nothing but relating two different things to each other. This method takes more time to set up but in the end you need
-- less rows of your database overall as you are REUSING data efficiently and effectively.
Select Top 10 *
from #Person p
join #OrderBridge o on p.PersonID = o.PersonID
join #TypeOfOrder t on o.OrderType = t.OrderTypeID

Related

Database wide table or referenced table with record type

I want to represent a vehicle (think car or truck) in a database. I have up to 62 pieces of information I'd like to store for each. Examples: year, make, model, drive type, brake system, Mfr. body code, steering type, wheel base, etc. The information are Ids which reference a 3rd party database which provides the labels for each Id. The provider has 1 table to list all makes, 1 table to list all "Steering types", etc.
All vehicles will populate the year, make, and model columns. Almost no record (if any) will populate more than 10 columns. But if I looked at all vehicles, then every column would be used by at least one record.
One approach would be to have a single table that has 62 columns. Again most records will have NULL values in most columns.
Alternatively I can do something like this (ignoring indices and primary key for sake of example):
create table vehicles (
id identity(1,1) int,
year int,
make int,
model int
)
create table constraints (
id identity(1,1) int,
vehicleId int, -- foreign key to vehicles.id
constraintTypeId int, -- foreign key to constraintTypes.id
value int
)
create table constraintTypes (
id identity(1,1) int,
name nvarchar(200) -- Example: "wheel base", "brake system" etc
)
With this second method if a vehicle only stores 2 pieces of information (aside from year, make, model), then it would have 2 records in table constraints.
Users wish to have a page to view all applications. If I have a table with 62 columns I'd need 62 joins in the query to get the labels. I could store labels on the vehicle to make retrieval faster, but than when labels change in the source data it might be slow to update my vehicles table.
At current there are over 12 million vehicle records, and the source data changes monthly (additions, deletes, and a few label changes).
Is it better a better design to have more columns, even if most are always just NULL. Or is the second approach better? How does one even calculate the best approach? Even if I had 62 columns they are all valid to every vehicle, but for cataloging purposes most are left empty. For example if a record should match any "1999 Dodge Viper" (regardless of steering type, or body style, etc) the user doesn't want to have to populate all 62 columns, they want to just see one record for "1999 Dodge Viper".
Your question is a specific case of the general issue related to data anomalys and normalisation. https://en.wikipedia.org/wiki/Database_normalization
And there is no 'right' answer although experience suggests there are 'better' and 'worse' answers. So a question to help you with your planning.
Will the requirements ever change? E.g. will someone one day want to
record the brake shoe type, or drivers seat type? If yes what are the
implications of your 62 column table becoming a 63 (or 99) column
table. (In my mind this leads me towards your second method)
Also remember thanks to Views the presentation of the data, even in the DB, does not have to match it storage. E.g. you can have well normalised tables and a view to show users 62 (or 63 or 99) columns.

Joining same column from same table multiple times

I need a two retrieve data from the same table but divided in different columns.
First table "PRODUCTS" has the following columns:
PROD_ID
PRO_TYPE_ID
PRO_COLOR_ID
PRO_WEIGHT_ID
PRO_PRICE_RANGE_ID
Second table "COUNTRY_TRANSLATIONS" has the following columns:
ATTRIBUTE_ID
ATT_LANGUAGE_ID
ATT_TEXT_ID
Third and last table "TEXT_TRANSLATIONS" has the following columns:
TRANS_TEXT_ID
TRA_TEXT
PRO_TYPE_ID, PRO_COLOR_ID, PRO_WEIGHT_ID and PRO_PRICE_RANGE_ID are all integers and are found back in the column ATTRIBUTE_ID multiple times (depending on howmany translations are available). Then ATT_TEXT_ID is joined with TRANS_TEXT_ID from the TEXT_TRANSLATIONS table.
Basically I need to run a query so I can retreive information from TEXT_TRANSLATIONS multiple times. Right now I get an error saying that the correlation is not unique.
The data is available in more then 20 languages, therefore the need to work with intergers for each of the attributes.
Any suggestion on how I should build up the query? Thank you.
Hopefully, you're on an RDBMS that supports CTEs (pretty much everything except mySQL), or you'll have to modify this to refer to the joined tables each time...
WITH Translations (attribute_id, text)
as (SELECT c.attribute_id, t.tra_text
FROM Country_Translations c
JOIN Text_Translations t
ON t.trans_text_id = c.att_text_id
WHERE c.att_language_id = #languageId)
SELECT Products.prod_id,
Type.text,
Color.text,
Weight.text,
Price_Range.text
FROM Products
JOIN Translations as Type
ON Type.attribute_id = Products.pro_type_id
JOIN Translations as Color
ON Color.attribute_id = Products.pro_color_id
JOIN Translations as Weight
ON Weight.attribute_id = Products.pro_weight_id
JOIN Translations as Price_Range
ON Price_Range.attribute_id = Products.pro_price_range_id
Of course, personally I think the design of the localization table was botched in two ways -
Everything is in the same table (especially without an 'attribute type' column).
The language attribute is in the wrong table.
For 1), this is mostly going to be a problem because you now have to maintain system-wide uniqueness of all attribute values. I can pretty much guarantee that, at some point, you're going to run into 'duplicates'. Also, unless you've designed your ranges with a lot of free space, the data values are non-consecutive for type; if you're not careful there is the potential for update statements being run over the wrong values, simply because the start and end of the given range belong to the same attribute, but not every value in the range.
For 2), this is because a text can't be completely divorced from it's language (and country 'locale'). From what I understand, there are parts of some text that are valid as written in multiple languages, but mean completely different things when read.
You'd likely be better off storing your localizations in something similar to this (only one table shown here, the rest are an exercise for the reader):
Color
=========
color_id -- autoincrement
cyan -- smallint
yellow -- smallint
magenta -- smallint
key -- smallint
-- assuming CYMK palette, add other required attributes
Color_Localization
===================
color_localization_id -- autoincrement, but optional:
-- the tuple (color_id, locale_id) should be unique
color_id -- fk reference to Color.color_id
locale_id -- fk reference to locale table.
-- Technically this is also country dependent,
-- but you can start off with just language
color_name -- localized text
This should make it so that all attributes have their own set of ids, and tie the localized text to what it was localized to directly.

Basic question: how to properly redesign this schema

I am hopping on a project that sits on top of a Sql Server 2008 DB with what seems like an inefficient schema to me. However, I'm not an expert at anything SQL, so I am seeking for guidance.
In general, the schema has tables like this:
ID | A | B
ID is a unique identifier
A contains text, such as animal names. There's very little variety; maybe 3-4 different values in thousands of rows. This could vary with time, but still a small set.
B is one of two options, but stored as text. The set is finite.
My questions are as follows:
Should I create another table for names contained in A, with an ID and a value, and set the ID as the primary key? Or should I just put an index on that column in my table? Right now, to get a list of A's, it does "select distinct(a) from table" which seems inefficient to me.
The table has a multitude of columns for properties of A. It could be like: Color, Age, Weight, etc. I would think that this is better suited in a separate table with: ID, AnimalID, Property, Value. Each property is unique to the animal, so I'm not sure how this schema could enforce this (the current schema implies this as it's a column, so you can only have one value for each property).
Right now the DB is easily readable by a human, but its size is growing fast and I feel like the design is inefficient. There currently is not index at all anywhere. As I said I'm not a pro, but will read more on the subject. The goal is to have a fast system. Thanks for your advice!
This sounds like a database that might represent a veterinary clinic.
If the table you describe represents the various patients (animals) that come to the clinic, then having properties specific to them are probably best on the primary table. But, as you say column "A" contains a species name, it might be worthwhile to link that to a secondary table to save on the redundancy of storing those names:
For example:
Patients
--------
ID Name SpeciesID Color DOB Weight
1 Spot 1 Black/White 2008-01-01 20
Species
-------
ID Species
1 Cocker Spaniel
If your main table should be instead grouped by customer or owner, then you may want to add an Animals table and link it:
Customers
---------
ID Name
1 John Q. Sample
Animals
-------
ID CustomerID SpeciesID Name Color DOB Weight
1 1 1 Spot Black/White 2008-01-01 20
...
As for your original column B, consider converting it to a boolean (BIT) if you only need to store two states. Barring that, consider CHAR to store a fixed number of characters.
Like most things, it depends.
By having the animal names directly in the table, it makes your reporting queries more efficient by removing the need for many joins.
Going with something like 3rd normal form (having an ID/Name table for the animals) makes you database smaller, but requires more joins for reporting.
Either way, make sure to add some indexes.

Architecture of SQL tables

I am wondering is it more useful and practical (size of DB) to create multiple tables in sql with two columns (one column containing foreign key and one column containing random data) or merge it and create one table containing multiple columns. I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
example a. one table
productID productname weight no_of_pages
1 book 130 500
2 watch 50 null
3 ring null null
example b. three tables
productID productname
1 book
2 watch
3 ring
productID weight
1 130
2 50
productID no_of_pages
1 500
The multi-table approach is more "normal" (in database terms) because it avoids columns that commonly store NULLs. It's also something of a pain in programming terms because you have to JOIN a bunch of tables to get your original entity back.
I suggest adopting a middle way. Weight seems to be a property of most products, if not all (indeed, a ring has a weight even if small and you'll probably want to know it for shipping purposes), so I'd leave that in the Products table. But number of pages applies only to a book, as do a slew of other unmentioned properties (author, ISBN, etc). In this example, I'd use a Products table and a Books table. The books table would extend the Products table in a fashion similar to class inheritance in object oriented program.
All book-specific properties go into the Books table, and you join only Products and Books to get a complete description of a book.
I think this all depends on how the tables will be used. Maybe your examples are oversimplifying things too much but it seems to me that the first option should be good enough.
You'd really use the second example if you're going to be doing extremely CPU intensive stuff with the first table and will only need the second and third tables when more information about a product is needed.
If you're going to need the information in the second and third tables most times you query the table, then there's no reason to join over every time and you should just keep it in one table.
I would suggest example a, in case there is a defined set of attributes for product, and an example c if you need variable number of attributes (new attributes keep coming every now and then) -
example c
productID productName
1 book
2 watch
3 ring
attrID productID attrType attrValue
1 1 weight 130
2 1 no_of_pages 500
3 2 weight 50
The table structure you have shown in example b is not normalized - there will be separate id columns required in second and third tables, since productId will be an fk and not a pk.
It depends on how many rows you are expecting on your PRODUCTS table. I would say that it would not make sense to normalize your tables to 3N in this case because product name, weight, and no_of_pages each describe the products. If you had repeating data such as manufacturers, it would make more sense to normalize your tables at that point.
Without knowing the background (data model), there is no way to tell which variant is more "correct". both are fine in certain scenarios.
You want three tables, full stop. That's best because there's no chance of watches winding up with pages (no pun intended) and some books without. If you normalize, the server works for you. If you don't, you do the work instead, just not as well. Up to you.
I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
That's always true of nullable columns. Here's the rule: a nullable column has an optional relationship to the key. A nullable column can always be, and usually should be, in a separate table where it can be non-null.

Generate unique ID to share with multiple tables SQL 2008

I have a couple of tables in a SQL 2008 server that I need to generate unique ID's for. I have looked at the "identity" column but the ID's really need to be unique and shared between all the tables.
So if I have say (5) five tables of the flavour "asset infrastructure" and I want to run with a unique ID between them as a combined group, I need some sort of generator that looks at all (5) five tables and issues the next ID which is not duplicated in any of those (5) five tales.
I know this could be done with some sort of stored procedure but I'm not sure how to go about it. Any ideas?
The simplest solution is to set your identity seeds and increment on each table so they never overlap.
Table 1: Seed 1, Increment 5
Table 2: Seed 2, Increment 5
Table 3: Seed 3, Increment 5
Table 4: Seed 4, Increment 5
Table 5: Seed 5, Increment 5
The identity column mod 5 will tell you which table the record is in. You will use up your identity space five times faster so make sure the datatype is big enough.
Why not use a GUID?
You could let them each have an identity that seeds from numbers far enough apart never to collide.
GUIDs would work but they're butt-ugly, and non-sequential if that's significant.
Another common technique is to have a single-column table with an identity that dispenses the next value each time you insert a record. If you need them pulling from a common sequence, it's not unlikely to be useful to have a second column indicating which table it was dispensed to.
You realize there are logical design issues with this, right?
Reading into the design a bit, it sounds like what you really need is a single table called "Asset" with an identity column, and then either:
a) 5 additional tables for the subtypes of assets, each with a foreign key to the primary key on Asset; or
b) 5 views on Asset that each select a subset of the rows and then appear (to users) like the 5 original tables you have now.
If the columns on the tables are all the same, (b) is the better choice; if they're all different, (a) is the better choice. This is a classic DB spin on the supertype / subtype relationship.
Alternately, you could do what you're talking about and recreate the IDENTITY functionality yourself with a stored proc that wraps INSERT access on all 5 tables. Note that you'll have to put a TRANSACTION around it if you want guarantees of uniqueness, and if this is a popular table, that might make it a performance bottleneck. If that's not a concern, a proc like that might take the form:
CREATE PROCEDURE InsertAsset_Table1 (
BEGIN TRANSACTION
-- SELECT MIN INTEGER NOT ALREADY USED IN ANY OF THE FIVE TABLES
-- INSERT INTO Table1 WITH THAT ID
COMMIT TRANSACTION -- or roll back on error, etc.
)
Again, SQL is highly optimized for helping you out if you choose the patterns I mention above, and NOT optimized for this kind of thing (there's overhead with creating the transaction AND you'll be issuing shared locks on all 5 tables while this process is going on). Compare that with using the PK / FK method above, where SQL Server knows exactly how to do it without locks, or the view method, where you're only inserting into 1 table.
I found this when searching on google. I am facing a simillar problem for the first time. I had the idea to have a dedicated ID table specifically to generate the IDs but I was unsure if it was something that was considered OK design. So I just wanted to say THANKS for confirmation.. it looks like it is an adequate sollution although not ideal.
I have a very simple solution. It should be good for cases when the number of tables is small:
create table T1(ID int primary key identity(1,2), rownum varchar(64))
create table T2(ID int primary key identity(2,2), rownum varchar(64))
insert into T1(rownum) values('row 1')
insert into T1(rownum) values('row 2')
insert into T1(rownum) values('row 3')
insert into T2(rownum) values('row 1')
insert into T2(rownum) values('row 2')
insert into T2(rownum) values('row 3')
select * from T1
select * from T2
drop table T1
drop table T2
This is a common problem for example when using a table of people (called PERSON singular please) and each person is categorized, for example Doctors, Patients, Employees, Nurse etc.
It makes a lot of sense to create a table for each of these people that contains thier specific category information like an employees start date and salary and a Nurses qualifications and number.
A Patient for example, may have many nurses and doctors that work on him so a many to many table that links Patient to other people in the PERSON table facilitates this nicely. In this table there should be some description of the realtionship between these people which leads us back to the categories for people.
Since a Doctor and a Patient could create the same Primary Key ID in their own tables, it becomes very useful to have a Globally unique ID or Object ID.
A good way to do this as suggested, is to have a table designated to Auto Increment the primary key. Perform an Insert on that Table first to obtain the OID, then use it for the new PERSON.
I like to go a step further. When things get ugly (some new developer gets got his hands on the database, or even worse, a really old developer, then its very useful to add more meaning to the OID.
Usually this is done programatically, not with the database engine, but if you use a BIG INT for all the Primary Key ID's then you have lots of room to prefix a number with visually identifiable sequence. For example all Doctors ID's could begin with 100, all patients with 110, all Nurses with 120.
To that I would append say a Julian date or a Unix date+time, and finally append the Auto Increment ID.
This would result in numbers like:
110,2455892,00000001
120,2455892,00000002
100,2455892,00000003
since the Julian date 100yrs from now is only 2492087, you can see that 7 digits will adequately store this value.
A BIGINT is 64-bit (8 byte) signed integer with a range of -9.22x10^18 to 9.22x10^18 ( -2^63 to 2^63 -1). Notice the exponant is 18. That's 18 digits you have to work with.
Using this design, you are limited to 100 million OID's, 999 categories of people and dates up to... well past the shelf life of your databse, but I suspect thats good enough for most solutions.
The operations required to created an OID like this are all Multiplication and Division which avoids all the gear grinding of text manipulation.
The disadvantage is that INSERTs require more than a simple TSQL statement, but the advantage is that when you are tracking down errant data or even being clever in your queries, your OID is visually telling you alot more than a random number or worse, an eyesore like GUID.