Database design, for optimized access - sql

I jump straight into the problem, currently I have a table as such
id | model | CategoryId | etc...
Now my new requirement is to have support for multiple categories. So I have two possible solutions in mind but I would like to know problems that both this designs might create. I also know that at most I can have 6 categories, also I can't create a linker table to link product to category.
On first design I would simply create column CategoryN
id | model | CategoryId1 | CategoryId2 | CategoryId3 | CategoryId4 | etc...
But this would make queries hideous,
id | model | CategoryId | etc...
My second approach is simply to add product for N categories
id | model | CategoryId | etc...
1 | ABC | 1 | etc...
2 | ABC | 2 | etc...
3 | ABC | 3 | etc...
I think queries would be cleaner but not necessarily simpler.
Another aspect is that I am looking at the performance of the queries and it looks like the first approach would be better.
I hope this is clear enough.
Thanks for any suggestions.

The third option is a many-to-many table to link a model to a category:
MODEL_CATEGORIES
model_id (primary key, foreign key to MODEL table)
category_id (primary key, foreign key to CATEGORY table)
Your example data would resemble:
model_id category_id
----------------------
1 1
1 2
1 3
This means there's no need for a category_id column in the MODEL table.

I think you're after a many-to-many relationship here.
Basically, you have a model_to_categories table that matches model ids against category ids.

Can you make the CategoryID values bit-field "flags"?
If so, you could keep your performance high by using the single CategoryID field that you have now, keep your queries simple by not adding a bunch of new columns, and would have up to 32 categories for each product over time (assuming that CategoryID is an int).
So, your CategoryID values would be:
id | model | CategoryId | etc...
1 | ABC | 1 | etc...
2 | ABC | 2 | etc...
3 | ABC | 4 | etc...
3 | ABC | 8 | etc...
You would store the total of all of the CategoryIDs in the CategoryID column (for backward compatibility) and then would have to test the value of the field to find out if a specific category were "set". Incidentally, you can do that directly in your query as well.
It's not "best practice", which would really require a brand-new table and a lot of joining, but if you are looking for a way to shoe-horn something in there that will work, bit-field flags will do the trick.

Related

Select unique combination of values (attributes) based on user_id

I have a table that has user a user_id and a new record for each return reason for that user. As show here:
| user_id | return_reason |
|--------- |-------------- |
| 1 | broken |
| 2 | changed mind |
| 2 | overpriced |
| 3 | changed mind |
| 4 | changed mind |
What I would like to do is generate a foreign key for each combination of values that are applicable in a new table and apply that key to the user_id in a new table. Effectively creating a many to many relationship. The result would look like so:
Dimension Table ->
| reason_id | return_reason |
|----------- |--------------- |
| 1 | broken |
| 2 | changed mind |
| 2 | overpriced |
| 3 | changed mind |
Fact Table ->
| user_id | reason_id |
|--------- |----------- |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 3 |
My thought process is to iterate through the table with a cursor, but this seems like a standard problem and therefore has a more efficient way of doing this. Is there a specific name for this type of problem? I also thought about pivoting and unpivoting. But that didn't seem too clean either. Any help or reference to articles in how to process this is appreciated.
The problem concerns data normalization and relational integrity. Your concept doesn't really make sense - Dimension table shows two different reasons with same ID and Fact table loses a record. Conventional schema for this many-to-many relationship would be three tables like:
Users table (info about users and UserID is unique)
Reasons table (info about reasons and ReasonID is unique)
UserReasons junction table (associates users with reasons - your
existing table). Assuming user could associate with same reason
multiple times, probably also need ReturnDate and OrderID_FK fields
in UserReasons.
So, need to replace reason description in first table (UserReasons) with a ReasonID. Add a number long integer field ReasonID_FK in that table to hold ReasonID key.
To build Reasons table based on current data, use DISTINCT:
SELECT DISTINCT return_reason INTO Reasons FROM UserReasons
In new table, rename return_reason field to ReasonDescription and add an autonumber field ReasonID.
Now run UPDATE action to populate ReasonID_FK field in UserReasons.
UPDATE UserReasons INNER JOIN UserReasons.return_reason ON Reasons.ReasonDescription SET UserReasons.ReasonID_FK = Reasons.ReasonID
When all looks good, delete return_reason field.

Is it better to use a separate table to store a list of values, or include the value directly in the current table?

I have a jobs table that stores information such as title, department, and salary. I'm wanting the user to be able to create a job using a form that has fields for the aforementioned information, as well as a field for the job category. category would be something like retail, or IT, for example.
I don't have any issues with the actual coding itself, but rather what the best way to design the database store the information in it. So my question is this: should I create a separate table categories that stores each job category, along with an ID, so that the tables would look something like this
categories jobs
+----+---------------+ +----+---------------+-------------+--------+-------------+
| id | category | | id | title | department | salary | category_id |
+----+---------------+ +----+---------------+-------------+--------+-------------+
| 1 | Retail | | 1 | Retail | department1 | 10000 | 2 |
+----+---------------+ +----+---------------+-------------+--------+-------------+
| 2 | IT | | 2 | IT | department2 | 12000 | 1 |
+----+---------------+ +----+---------------+-------------+--------+-------------+
where category_id is a foreign key linking to the categories table,
or should I do something like this, where all the information is stored in a single table:
jobs
+----+---------------+-------------+--------+-------------+
| id | title | department | salary | category |
+----+---------------+-------------+--------+-------------+
| 1 | Retail | department1 | 10000 | IT |
+----+---------------+-------------+--------+-------------+
| 2 | IT | department2 | 12000 | Retail |
+----+---------------+-------------+--------+-------------+
Which is the better option? They both seem to achieve the same result, but what are the pros and cons of doing it either way, and which way would be the more preferred way of doing it?
In general, you want to store "entities" in separate tables. In this case, category is a separate entity from jobs.
Why do you want to do this?
There is only one row per category, so you don't have to worry about duplication -- and errors.
There may be additional information that you want to store, such as the creation date, abbreviation, who created it, and so on.
Properly declared foreign key constraints ensure that only valid categories are stored.
Categories may be shared across different tables, and a separate reference table ensures that the values are consistent.

Correct Database Design / Relationship

Below I have shown a basic example of my proposed database tables.
I have two questions:
Categories "Engineering", "Client" and "Vendor" will have exactly the same "Disciplines", "DocType1" and "DocType2", does this mean I have to enter these 3 times over in the "Classification" table, or is there a better way? Bear in mind there is the "Vendor" category that is also covered in the classification table.
In the "Documents" table I have shown "category_id" and "classification_id", I'm not sure if the will depend on the answer to the first question, but is "category_id" necessary, or should I just be using a JOIN to allow me to filter the category based on the classification_id?
Thank you in advance.
Table: Category
id | name
---|-------------
1 | Engineering
2 | Client
3 | Vendor
4 | Commercial
Table: Discipline
id | name
---|-------------
1 | Electrical
2 | Instrumentation
3 | Proposals
Table: DocType1
id | name
---|-------------
1 | Specifications
2 | Drawings
3 | Lists
4 | Tendering
Table: Classification
id | category_id | discipline_id | doctype1_id | doctype2
---|-------------|---------------|-------------|----------
1 | 1 | 1 | 2 | 00
2 | 1 | 1 | 2 | 01
3 | 2 | 1 | 2 | 00
4 | 4 | 3 | 4 | 00
Table: Documents
id | title | doc_number | category_id | classification_id
---|-----------------|------------|-------------|-------------------
1 | Electrical Spec | 0001 | 1 | 1
2 | Electrical Spec | 0002 | 2 | 3
3 | Quotation | 0003 | 3 | 4
From what you've provided, it looks like we have three simple lookup tables: category, discipline, and doctype1. The part that's not intuitively obvious to me and may also be causing confusion on your end, is that the last two tables are both serving as cross-references of the lookup tables. The classification table in particular seems like it might be out of place. If there are only certain combinations of category, discipline, and doctype that would ever be valid, then the classification table makes sense and the right thing to do would be to look up that valid combination by way of the classification ID from the document table. If this is not the case, then you would probably just want to reference the category, discipline, and document type directly from the document table.
In your example, the need to make this distinction is illuminated by the fact that the document table has a referenc to the classification table and a references to the category table. However the row that is looked up in the classification table also references a category ID. This is not only redundant but also opens the door to the possibility of having conflicting category IDs.
I hope this helps.

Organizing & normalising RSS Feed categories data

I am having trouble normalising data from a RSS Feed into a database.
Each post would have id and categories.
The problem I am having is that categories is a list which is not predefined in size. By 1NF I should split a list up such that each column only has atomic data:
+----+----------+
| id | name |
+----+----------+
| 1 | flying |
| 2 | swimming |
| 3 | throwing |
| 4 | sleeping |
| 5 | etc |
+----+----------+
However, blog posts can have more than one category tagged. This means that the posts table can have a list of ids of the categories tagged.
Alternatively, the categories table can have two ids:
+----+--------+----------+
| id | postId | name |
+----+--------+----------+
| 1 | 1 | flying |
| 2 | 1 | swimming |
| 3 | 1 | throwing |
| 4 | 2 | flying |
| 5 | 2 | swimming |
| 6 | 2 | etc |
+----+--------+----------+
And the posts table id will reference the postId column. However, there is repeated data, which is not good.
Lastly, another method I had thought of was to put all the categories in one table:
+----+--------+----------+----------+----------+-----+
| id | flying | swimming | throwing | sleeping | etc |
+----+--------+----------+----------+----------+-----+
| 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1 | 1 | 0 | 0 | 1 |
| 4 | 0 | 0 | 1 | 1 | 1 |
+----+--------+----------+----------+----------+-----+
1s representing present and 0s representing absent, the id in the posts table references id. This method would not have any repeated data. However, categories from blogs can be created at will, making it hard to maintain such a table as I would need to update it every time there is a new category.
How do I put my database in 3NF, eliminating repetition while keeping it maintainable?
TL;DR "Repeated data" is a bugbear. Learn about design and normalization. Start with rows/tables that make clear straightforward relevant statements about an arbitrary situation. So far all you need is:
-- [id] identifies a post with ...
Post(id, ...)
-- post [id] is tagged [name]
Post_Category(id, name)
there is repeated data, which is not good
What exactly do you think "repeated data" is? And why exactly do you think it's "not good"?
There is nothing intrinsically bad about having the same value appear multiple times as a column of a row or part of a value for a column of a row. What matters is whether rows in tables say overlapping things about a situation in certain ways.
Normalization replaces a table by projections of it that join back to it. That means that it replaces tables whose rows say (ie have predicate) "some stuff AND other stuff" about column values by tables whose rows say "some stuff" and "other stuff" separately. Having "AND"s in such a row/table meaning isn't always bad. When there's only one AND, normalization says to decompose to a particular pair of tables exactly when no shared column set always holds a unique set of values in either of the two tables.
put all the categories in one table
Although there is nothing about such a design that would cause normalization to decompose it, your last table is a "bad" design. (Sometimes this kind of design with repeated similar columns is said to violate some notion of "1NF" or "normalization", but that is a misconception.) Eg its rows say "(post [id] is tagged 'flying' and [flying] = 1 OR post [id] is not tagged 'flying' AND [flying] = 0) AND (post [id] is tagged 'swimming' and [swimming] = 1 OR post [id] is not tagged 'swimming' AND [swimming] = 0) AND ..." when instead we could just have a table Post_Category with rows saying "post [id] is tagged [name]". Eg we cannot write queries that ask about all categories without mentioning all categories explicitly. Eg if we add a new category then we must add a new column to the table and then if we want our past queries re all categories to mean the same thing then they we must add the new column to still be referring to all categories.
PS It's not clear why you introduced ids. There are reasons we do so, but you should do it for a reason. (Normalization does not introduce ids.) Eg introducing post ids if posts are not uniquely identifiable by other information we want to record.

Ways to implement tags - pros and cons of each

Related
Using SO as an example, what is the most sensible way to manage tags if you anticipate they will change often?
Way 1: Seriously denormalized (comma delimited)
table posts
+--------+-----------------+
| postId | tags |
+--------+-----------------+
| 1 | c++,search,code |
Here tags are comma delimited.
Pros: Tags are retrieved at once with a single select query. Updating tags is simple. Easy and cheap to update.
Cons: Extra parsing on tag retrieval, difficult to count how many posts use which tags.
(alternatively, if limited to something like 5 tags)
table posts
+--------+-------+-------+-------+-------+-------+
| postId | tag_1 | tag_2 | tag_3 | tag_4 | tag_5 |
+--------+-------+-------+-------+-------+-------+
| 1 | c++ |search | code | | |
Way 2: "Slightly normalized" (separate table, no intersection)
table posts
+--------+-------------------+
| postId | title |
+--------+-------------------+
| 1 | How do u tag? |
table taggings
+--------+---------+
| postId | tagName |
+--------+---------+
| 1 | C++ |
| 1 | search |
Pros: Easy to see tag counts (count(*) from taggings where tagName='C++').
Cons: tagName will likely be repeated many, many times.
Way 3: The cool kid's (normalized with intersection table)
table posts
+--------+---------------------------------------+
| postId | title |
+--------+---------------------------------------+
| 1 | Why is a raven like a writing desk? |
table tags
+--------+---------+
| tagId | tagName |
+--------+---------+
| 1 | C++ |
| 2 | search |
| 3 | foofle |
table taggings
+--------+---------+
| postId | tagId |
+--------+---------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
Pros:
No repeating tag names.
More girls will like you.
Cons: More expensive to change tags than way #1.
These solutions are called mysqlicious, scuttle and toxi.
This article compares benefits and drawbacks of each.
I would argue that there is a fourth solution which is a variation on your third solution:
Create Table Posts
(
id ...
, title ...
)
Create Table Tags
(
name varchar(30) not null primary key
, ...
)
Create Table PostTags
(
PostId ...
, TagName varchar(30) not null
, Constraint FK_PostTags_Posts
Foreign Key ( PostId )
References Posts( Id )
, Constraint FK_PostTags_Tags
Foreign Key ( TagName )
References Tags( Name )
On Update Cascade
On Delete Cascade
)
Notice that I'm using the tag name as the primary key of the Tags table. In this way, you can filter on certain tags without the extra join to the Tags table itself. In addition, if you change a tag name, it will update the names in the PostTags table. If changing a tag name is a rare occurrence, then this shouldn't be a problem. If changing a tag name is a common occurrence, then I would go with your third solution where you use a surrogate key to reference the tag.
I personally favour solution #3.
I don't agree that solution #1 is easier to mantain.
Think of the situation where you have to change the name of a tag.
Solution #1:
UPDATE posts SET tag = REPLACE(tag, "oldname", "newname") WHERE tag LIKE("%oldname%")
Solution #3:
UPDATE tags SET tag = "newname" WHERE tag = "oldname"
The first one is way heavier.
Also you have to deal with the commas when deleting tags (OK, it's easily done but still, more difficult that just deleting one line in the taggings table)
As for solution #2... is neither fish nor fowl
I think that SO uses solution #1. I'd go with either #1 or #3.
One thing to consider is if you have several thing that you can tag (e.g. adding tags to both post and products, for example). This may affect database solution.
Well I have the same doubt I adopted the third solution for my website. I know there is another way for dealing with this problem of variable-length tuples which consists in using columns as rows in this way you will have some information identifying the tuple redudant and the varying ones organized one for each row.
+--------+-------+-------------------------------------+
| postId | label | value |
+--------+-------+-------------------------------------+
| 1 | tag |C++ |
+--------+-------+-------------------------------------+
| 1 | tag |search |
+--------+-------+-------------------------------------+
| 1 | tag |code |
+--------+-------+-------------------------------------+
| 1 | title | Why is a raven like a writing desk? |
+--------+-------+-------------------------------------+
This is really bad but sometimes it's the only feasible solution, and it's very far from the relational approach.