Correct Database Design / Relationship - sql

Below I have shown a basic example of my proposed database tables.
I have two questions:
Categories "Engineering", "Client" and "Vendor" will have exactly the same "Disciplines", "DocType1" and "DocType2", does this mean I have to enter these 3 times over in the "Classification" table, or is there a better way? Bear in mind there is the "Vendor" category that is also covered in the classification table.
In the "Documents" table I have shown "category_id" and "classification_id", I'm not sure if the will depend on the answer to the first question, but is "category_id" necessary, or should I just be using a JOIN to allow me to filter the category based on the classification_id?
Thank you in advance.
Table: Category
id | name
---|-------------
1 | Engineering
2 | Client
3 | Vendor
4 | Commercial
Table: Discipline
id | name
---|-------------
1 | Electrical
2 | Instrumentation
3 | Proposals
Table: DocType1
id | name
---|-------------
1 | Specifications
2 | Drawings
3 | Lists
4 | Tendering
Table: Classification
id | category_id | discipline_id | doctype1_id | doctype2
---|-------------|---------------|-------------|----------
1 | 1 | 1 | 2 | 00
2 | 1 | 1 | 2 | 01
3 | 2 | 1 | 2 | 00
4 | 4 | 3 | 4 | 00
Table: Documents
id | title | doc_number | category_id | classification_id
---|-----------------|------------|-------------|-------------------
1 | Electrical Spec | 0001 | 1 | 1
2 | Electrical Spec | 0002 | 2 | 3
3 | Quotation | 0003 | 3 | 4

From what you've provided, it looks like we have three simple lookup tables: category, discipline, and doctype1. The part that's not intuitively obvious to me and may also be causing confusion on your end, is that the last two tables are both serving as cross-references of the lookup tables. The classification table in particular seems like it might be out of place. If there are only certain combinations of category, discipline, and doctype that would ever be valid, then the classification table makes sense and the right thing to do would be to look up that valid combination by way of the classification ID from the document table. If this is not the case, then you would probably just want to reference the category, discipline, and document type directly from the document table.
In your example, the need to make this distinction is illuminated by the fact that the document table has a referenc to the classification table and a references to the category table. However the row that is looked up in the classification table also references a category ID. This is not only redundant but also opens the door to the possibility of having conflicting category IDs.
I hope this helps.

Related

Element with the most votes for each combination of attributes

In my schema, a user can vote for different monsters that have different powers (eg lighting, fire) and different bodies.
Body is a polymorphic association, as it can be from different types of animals.
Here's the relevant pieces of the schema:
votes:
monster_id
power_id
body_id #polymorphic association
body_type #polymorphic association
For every combination of power and body with representation on the votes table, I want to find out the monsters that got the most votes.
Eg of a specific example:
--------------------------------------------------
| votes |
--------------------------------------------------
| monster_id| power_id | body_id | body_type |
--------------------------------------------------
| 1 | 1 | 1 | Body::Mammal |
| 2 | 1 | 1 | Body::Mammal |
| 2 | 1 | 1 | Body::Mammal |
| 11 | 2 | 11 | Body::Reptile |
| 11 | 2 | 11 | Body::Reptile |
| 22 | 2 | 11 | Body::Reptile |
--------------------------------------------------
Results I would like:
- ["For the combination (power_id: 1, body_id: 1, body_type: Body::Mammal), the monster with most votes is monster_id: 2",
"For the combination (power_id: 2, body_id: 11, body_type: Body::Reptile), the monster with most votes is monster_id: 11",
...]
I am using Rails 6 and postgres so I have the option to use ActiveRecord, for which I have a slight preference, but I realize this likely needs raw sql.
I understand the answer is very likely an extension of the one given in this question, where they do a similar thing with less attributes, but I can't seem to add the extra complexity needed to accommodate increased number of columns in play.
sql: select most voted items from each user
If I follow you correctly, you can use distinct on and aggregation:
select distinct on (body_id, power_id, body_type)
body_id, power_id, body_type, monster_id, count(*) cnt_votes
from votes
group by body_id, power_id, body_type, monster_id
order by body_id, power_id, body_type, count(*) desc

Organizing & normalising RSS Feed categories data

I am having trouble normalising data from a RSS Feed into a database.
Each post would have id and categories.
The problem I am having is that categories is a list which is not predefined in size. By 1NF I should split a list up such that each column only has atomic data:
+----+----------+
| id | name |
+----+----------+
| 1 | flying |
| 2 | swimming |
| 3 | throwing |
| 4 | sleeping |
| 5 | etc |
+----+----------+
However, blog posts can have more than one category tagged. This means that the posts table can have a list of ids of the categories tagged.
Alternatively, the categories table can have two ids:
+----+--------+----------+
| id | postId | name |
+----+--------+----------+
| 1 | 1 | flying |
| 2 | 1 | swimming |
| 3 | 1 | throwing |
| 4 | 2 | flying |
| 5 | 2 | swimming |
| 6 | 2 | etc |
+----+--------+----------+
And the posts table id will reference the postId column. However, there is repeated data, which is not good.
Lastly, another method I had thought of was to put all the categories in one table:
+----+--------+----------+----------+----------+-----+
| id | flying | swimming | throwing | sleeping | etc |
+----+--------+----------+----------+----------+-----+
| 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1 | 1 | 0 | 0 | 1 |
| 4 | 0 | 0 | 1 | 1 | 1 |
+----+--------+----------+----------+----------+-----+
1s representing present and 0s representing absent, the id in the posts table references id. This method would not have any repeated data. However, categories from blogs can be created at will, making it hard to maintain such a table as I would need to update it every time there is a new category.
How do I put my database in 3NF, eliminating repetition while keeping it maintainable?
TL;DR "Repeated data" is a bugbear. Learn about design and normalization. Start with rows/tables that make clear straightforward relevant statements about an arbitrary situation. So far all you need is:
-- [id] identifies a post with ...
Post(id, ...)
-- post [id] is tagged [name]
Post_Category(id, name)
there is repeated data, which is not good
What exactly do you think "repeated data" is? And why exactly do you think it's "not good"?
There is nothing intrinsically bad about having the same value appear multiple times as a column of a row or part of a value for a column of a row. What matters is whether rows in tables say overlapping things about a situation in certain ways.
Normalization replaces a table by projections of it that join back to it. That means that it replaces tables whose rows say (ie have predicate) "some stuff AND other stuff" about column values by tables whose rows say "some stuff" and "other stuff" separately. Having "AND"s in such a row/table meaning isn't always bad. When there's only one AND, normalization says to decompose to a particular pair of tables exactly when no shared column set always holds a unique set of values in either of the two tables.
put all the categories in one table
Although there is nothing about such a design that would cause normalization to decompose it, your last table is a "bad" design. (Sometimes this kind of design with repeated similar columns is said to violate some notion of "1NF" or "normalization", but that is a misconception.) Eg its rows say "(post [id] is tagged 'flying' and [flying] = 1 OR post [id] is not tagged 'flying' AND [flying] = 0) AND (post [id] is tagged 'swimming' and [swimming] = 1 OR post [id] is not tagged 'swimming' AND [swimming] = 0) AND ..." when instead we could just have a table Post_Category with rows saying "post [id] is tagged [name]". Eg we cannot write queries that ask about all categories without mentioning all categories explicitly. Eg if we add a new category then we must add a new column to the table and then if we want our past queries re all categories to mean the same thing then they we must add the new column to still be referring to all categories.
PS It's not clear why you introduced ids. There are reasons we do so, but you should do it for a reason. (Normalization does not introduce ids.) Eg introducing post ids if posts are not uniquely identifiable by other information we want to record.

design for vehicle identification number (VIN)

I've designed a few Vehicle Identification Number (VIN) decoders for different OEMs. The thing about VIN numbers...despite being somewhat standardized, each OEM can overload the character position codes and treat them differently, add "extra" metadata (i.e. asterisks pointing to more data outside the VIN number), etc., etc. Despite all that, I've been able to build several different OEM VIN decoders, and now I'm trying to build a GM VIN decoder, and it is giving me a headache.
The gist of the problem is that GM treats the vehicle attributes section (position 4,5,6,7) differently depending on whether it is a truck or a car. Here is the breakdown:
GM Passenger Car VIN breakdown
GM Truck VIN breakdown
Normally what I do is design my own crude ETL process to import the data into an RDMBS - each table roughly correlates with the major VIN breakdown. For example, there will be a WMI table, EngineType table, ModelYear table, AssemblyPlant table, etc. Then I construct a View that joins on some contextual data that may or not be gleaned directly from the character codes in the VIN number itself (e.g. some vehicle types only have certain vehicle engines).
To look up a VIN is simply a matter of querying the VIEW with each major character code position breakdown of the VIN string. For example, an example VIN of 1FAFP53UX4A162757 breaks down like this in a different OEM's VIN structure:
| WMI | Restraint | LineSeriesBody | Engine | CheckDigit | Year | Plant | Seq |
| 123 | 4 | 567 | 8 | 9 | 10 | 11 | 12-17 |
---------------------------------------------------------------------------------
| 1FA | F | P53 | U | X | 4 | A | ... |
GM has thrown a wrench into this...depending on whether it is a car or truck, the character code positions mean different things.
Example of what I mean - each ASCII table below correlates somewhat to a SQL table. etc.. means there is a whole lot of other columnar data
Passenger Car
Here's an example of position 4,5 (corresponds to vehicle line/series). These really go together, the VIN source data doesn't really differentiate between position 4 and 5 despite the breakdown illustrated above.
| Code (45)| Line | Series | etc..
--------------------------------------
| GA | Buick | Lacrosse | etc..
..and position 6 corresponds to body style
| Code (6) | Style | etc..
--------------------------------------
| 1 | Coupe, 2-Door | etc..
Trucks
..but for trucks, the structure is completely different. Consider position 4 stands on its own as Grosse Vehicle Weight Restriction GVWR.
| Code (4) | GVWR | etc..
-------------------------------
| L | 6000 lbs | etc..
..and positions 5,6 (Chassis/Series) now mean something similar to position 4,5 of passenger car:
| Code (56) | Line | Series | etc..
---------------------------------------
| RV | Buick | Enclave | etc..
I'm looking for a crafty way to resolve this in the relational design. I would like to return a common structure when a VIN is decoded -- if possible (i.e. not returning a different structure for cars vs. trucks)
Based on your answer to my comment regarding if you can identify the type of vehicle by using other values, a possible approach could be to have a master table with the common fields and 2 detail tables, each one with the appropriate fields for either cars or trucks.
Approximately something like the following (here I am guessing WMI is the PK):
Master table
| WMI | Restraint | Engine | CheckDigit | Year | Plant | Seq |
| 123 | 4 | 8 | 9 | 10 | 11 | 12-17 |
Car detail table
| WMI | Veh Line | Series | Body Type |
| 123 | 2 | 3 | 4 |
Truck detail table
| WMI | GWVR | Chassis |Body Type |
| 123 | 7 | 8 | 9 |
Having this, you could use a unique select to retrieve the needed data like following:
Select *
From
(
Select M.*,
C.*,
Null GWVR,
Null Chassis,
Null Truck_Body_Type
From Master_Table M
Left Join Car_Table C
on M.WMI = C.WMI
Union
Select M.*,
Null Veh_Line,
Null Series,
Null Car_Body_Type
T.*
From Master_Table M
Left Join Truck_Table T
on M.WMI = T.WMI
)
As for DML SQL you would only need to control prior to insert or update sentences whether you have a car or a truck model.
Of course you would need to make sure that only one detail exists for each master row, either on the car detail table or on the truck detail table.
HTH
Why you do not define both of these rules for the decoding; only one will resolve a valid result.

How to store Goals (think RPG Quest) in SQL

Someone asked me today how they should store quest goals in a SQL database. In this context, think of an RPG. Goals could include some of the following:
Discover [Location]
Kill n [MOB Type]
Acquire n of [Object]
Achieve a [Skill] in [Skillset]
All the other things you get in RPGs
The best I could come up with is:
Quest 1-* QuestStep
QuestStep 1-* MobsToKill
QuestStep 1-* PlacesToFind
QuestStep 1-* ThingsToAcquire
QuestStep 1-* etc.
This seems a little clunky - Should they be storing a query of some description instead (or a formula or ???)
Any suggestions appreciated
User can embark on many quests.
One quest belongs to one user only (in this model).
One quest has many goals, one goal belongs to one quest only.
Each goal is one of possible goals.
A possible goal is an allowed combination of an action and an object of the action.
PossibleGoals table lists all allowed combinations of actions and objects.
Goals are ordered by StepNo within a quest.
Quantity defines how many objects should an action act upon, (kill 5 MOBs).
Object is a super-type for all possible objects.
Location, MOBType, and Skill are object sub-types, each with different properties (columns).
I would create something like this.
For the Quest table:
| ID | Title | FirstStep (Foreign key to GuestStep table) | etc.
The QuestStep table
| ID | Title | Goal (Foreign key to Goal table) | NextStep (ID of next QuestStep) | etc.
Ofcourse this is where the hard part start, how do we describe the goals? I'd say create one record for the goal in the Goal table and save each of the fields of the goal (I.E. how many mobs of what type to kill, what location to visit, etc.) in a GoalFields table, thus:
Goal table:
| ID | Type (type is one from an Enum of goal types) |
The GoalFields Table
| ID | Goal (Foreign key to goal) | Field | Value |
I understand that this can be a bit vague, so here is an example of what dat in the database could look like.
Quest table
| 0 | "Opening quest" | 0 | ...
| 1 | "Time for a Sword" | 2 | ...
QuestStep table
| 0 | "Go to the castle" | 0 | 1 | ...
| 1 | "Kill two fireflies" | 1 | NULL | ...
| 2 | "Get a sword" | 2 | NULL | ...
Goal table
| 0 | PlacesToFind |
| 1 | MobsToKill |
| 2 | ThingsToAcquire |
GoalFields table
| 0 | 0 | Place | "Castle" |
| 1 | 1 | Type | "firefly" |
| 2 | 1 | Amount | 2 |
| 3 | 2 | Type | "sword" |
| 4 | 2 | Amount | 1 |

Retrieve comma delimited data from a field

I've created a form in PHP that collects basic information. I have a list box that allows multiple items selected (i.e. Housing, rent, food, water). If multiple items are selected they are stored in a field called Needs separated by a comma.
I have created a report ordered by the persons needs. The people who only have one need are sorted correctly, but the people who have multiple are sorted exactly as the string passed to the database (i.e. housing, rent, food, water) --> which is not what I want.
Is there a way to separate the multiple values in this field using SQL to count each need instance/occurrence as 1 so that there are no comma delimitations shown in the results?
Your database is not in the first normal form. A non-normalized database will be very problematic to use and to query, as you are actually experiencing.
In general, you should be using at least the following structure. It can still be normalized further, but I hope this gets you going in the right direction:
CREATE TABLE users (
user_id int,
name varchar(100)
);
CREATE TABLE users_needs (
need varchar(100),
user_id int
);
Then you should store the data as follows:
-- TABLE: users
+---------+-------+
| user_id | name |
+---------+-------+
| 1 | joe |
| 2 | peter |
| 3 | steve |
| 4 | clint |
+---------+-------+
-- TABLE: users_needs
+---------+----------+
| need | user_id |
+---------+----------+
| housing | 1 |
| water | 1 |
| food | 1 |
| housing | 2 |
| rent | 2 |
| water | 2 |
| housing | 3 |
+---------+----------+
Note how the users_needs table is defining the relationship between one user and one or many needs (or none at all, as for user number 4.)
To normalise your database further, you should also use another table called needs, and as follows:
-- TABLE: needs
+---------+---------+
| need_id | name |
+---------+---------+
| 1 | housing |
| 2 | water |
| 3 | food |
| 4 | rent |
+---------+---------+
Then the users_needs table should just refer to a candidate key of the needs table instead of repeating the text.
-- TABLE: users_needs (instead of the previous one)
+---------+----------+
| need_id | user_id |
+---------+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 1 | 2 |
| 4 | 2 |
| 2 | 2 |
| 1 | 3 |
+---------+----------+
You may also be interested in checking out the following Wikipedia article for further reading about repeating values inside columns:
Wikipedia: First normal form - Repeating groups within columns
UPDATE:
To fully answer your question, if you follow the above guidelines, sorting, counting and aggregating the data should then become straight-forward.
To sort the result-set by needs, you would be able to do the following:
SELECT users.name, needs.name
FROM users
INNER JOIN needs ON (needs.user_id = users.user_id)
ORDER BY needs.name;
You would also be able to count how many needs each user has selected, for example:
SELECT users.name, COUNT(needs.need) as number_of_needs
FROM users
LEFT JOIN needs ON (needs.user_id = users.user_id)
GROUP BY users.user_id, users.name
ORDER BY number_of_needs;
I'm a little confused by the goal. Is this a UI problem or are you just having trouble determining who has multiple needs?
The number of needs is the difference:
Len([Needs]) - Len(Replace([Needs],',','')) + 1
Can you provide more information about the Sort you're trying to accomplish?
UPDATE:
I think these Oracle-based posts may have what you're looking for: post and post. The only difference is that you would probably be better off using the method I list above to find the number of comma-delimited pieces rather than doing the translate(...) that the author suggests. Hope this helps - it's Oracle-based, but I don't see .