I got 2 tables User and Company.
These two have a inventory.
DataTables
Item
ItemId | Name
================
1 | Glass
4 | Wood
User
UId | Name
============
1 | Max
Company
CId | Name
==================
1 | EvilCorp
Inventory
RowId | UId | CId | ItemId | amount
=================================================
1 | 2 | Null | 4 | 10
2 | 23 | Null | 4 | 5
3 | Null | 1 | 1 | 7
4 | Null | 1 | 4 | 70
Let say I have 500 users and 300 companys and every one has 20 inventory slots, I will have 16000 null values in my Inventory table (6000 UId nulls + 10000 CId nulls).
I want a SQL query that will say this information.
Result
Owner | Item | Amount
===========================
MAX | Wood | 10
EvilCorp | Glass | 7
EvilCorp | Wood | 40
My problem is that my Inventory table is bad due to all the nulls that will appears against CId when the record is for a User, and vice versa.
Do you know how to create a good table, without huge or/and complex SQL queries?
You have Users, Companies and Owners. Users and Companies are Owners.
This is a common situation that can be put into tables various ways.
Every table holds the rows that make some statement true:
// "user [OId] has ..."
User(OId,...)
fk OId references Owner -- a user is an owner
// "company [OId] has ..."
Company(OId,....)
fk OId references Owner -- a company is an owner
// "Owner [OId] has ..."
Owner(OId,...)
// "[OId] owns [n] of item [itemId]"
Inventory(Oid,ItemId,amount)
fk OId references Owner
fk ItemId references Item
This comes from choosing tables that make plain statements about parts of your application. For variations read about subtables and supertables for subtypes and supertypes. See this answer or this one.
Related
The project I'm working on is an application that lets you design data entry forms, and automagically generates a schema in an underlying PostgreSQL database
to persist them as well as the browsing and editing UI.
The use case I've encountered this with is a store back-office database, but the app itself intends to be somewhat universal. The administrator creates the following entry forms with the given fields:
Customers
name (text box)
Items
name (text box)
stock (number field)
Order
customer (combo box selecting a customer)
order lines (a grid showing order lines)
OrderLine
item (combo box selecting an item)
count (number field)
When all this is done, the resulting database schema will be equivalent to this:
create table Customers(id serial primary key,
name varchar);
create table Items(id serial primary key,
name varchar,
stock integer);
create table Orders(id serial primary key);
create table OrderLines(id serial primary key,
count integer);
create table Links(id serial primary key,
fk1 integer references Customers.id,
fk2 integer references Items.id,
fk3 integer references Orders.id,
fk4 integer references OrderLines.id);
Links being a special table that stores all the relationships between entities; every row has (usually) two of the foreign keys set to a value, and the rest set to NULL. Whenever a new entry form is added to the application instance, a new foreign key referencing the table for this form is added to Links.
So, suppose our shop stocks some widgets, gizmos, and thingeys. A customer named Adam orders two widgets and three gizmos, and Betty orders four gizmos and five thingeys. The database will contain the following data:
Customers
/----+-------\
| ID | NAME |
| 1 | Adam |
| 2 | Betty |
\----+-------/
Items
/----+---------+-------\
| ID | NAME | STOCK |
| 1 | widget | 123 |
| 2 | gizmo | 456 |
| 3 | thingey | 789 |
\----+---------+-------/
Orders
/----\
| ID |
| 1 |
| 2 |
\----/
OrderLines
/----+-------\
| ID | COUNT |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
\----+-------/
Links
/----+------+------+------+------\
| ID | FK1 | FK2 | FK3 | FK4 |
| 1 | 1 | NULL | 1 | NULL |
| 2 | 2 | NULL | 2 | NULL |
| 3 | NULL | NULL | 1 | 1 |
| 4 | NULL | NULL | 1 | 2 |
| 5 | NULL | NULL | 2 | 3 |
| 6 | NULL | NULL | 2 | 4 |
| 7 | NULL | 1 | NULL | 1 |
| 8 | NULL | 2 | NULL | 2 |
| 9 | NULL | 2 | NULL | 3 |
| 10 | NULL | 3 | NULL | 4 |
\----+------+------+------+------/
(The tables also contain a bunch of timestamps for auditing and soft deletion but I don't think they're relevant here, they just make writing the SQL by the administrator that much messier. The management app is also used to implement a bunch of different use cases, but they're generally primarily data entry, master-detail views, and either scalar fields or selection boxes.)
When I've had to write a join through this thing I'd grumbled about it to my coworker, who replied "well using separate tables for each relationship is one way to do it, this is another..." Leaving aside the obvious-to-me ugliness of the above and the practical issues, I also have a nagging feeling this has to be a violation of some normal form, but it's been a while since college and I'm struggling to figure out which of the criteria apply here.
Is there something stronger "well that's just your opinion" I can use when critiquing this design?
Below I have shown a basic example of my proposed database tables.
I have two questions:
Categories "Engineering", "Client" and "Vendor" will have exactly the same "Disciplines", "DocType1" and "DocType2", does this mean I have to enter these 3 times over in the "Classification" table, or is there a better way? Bear in mind there is the "Vendor" category that is also covered in the classification table.
In the "Documents" table I have shown "category_id" and "classification_id", I'm not sure if the will depend on the answer to the first question, but is "category_id" necessary, or should I just be using a JOIN to allow me to filter the category based on the classification_id?
Thank you in advance.
Table: Category
id | name
---|-------------
1 | Engineering
2 | Client
3 | Vendor
4 | Commercial
Table: Discipline
id | name
---|-------------
1 | Electrical
2 | Instrumentation
3 | Proposals
Table: DocType1
id | name
---|-------------
1 | Specifications
2 | Drawings
3 | Lists
4 | Tendering
Table: Classification
id | category_id | discipline_id | doctype1_id | doctype2
---|-------------|---------------|-------------|----------
1 | 1 | 1 | 2 | 00
2 | 1 | 1 | 2 | 01
3 | 2 | 1 | 2 | 00
4 | 4 | 3 | 4 | 00
Table: Documents
id | title | doc_number | category_id | classification_id
---|-----------------|------------|-------------|-------------------
1 | Electrical Spec | 0001 | 1 | 1
2 | Electrical Spec | 0002 | 2 | 3
3 | Quotation | 0003 | 3 | 4
From what you've provided, it looks like we have three simple lookup tables: category, discipline, and doctype1. The part that's not intuitively obvious to me and may also be causing confusion on your end, is that the last two tables are both serving as cross-references of the lookup tables. The classification table in particular seems like it might be out of place. If there are only certain combinations of category, discipline, and doctype that would ever be valid, then the classification table makes sense and the right thing to do would be to look up that valid combination by way of the classification ID from the document table. If this is not the case, then you would probably just want to reference the category, discipline, and document type directly from the document table.
In your example, the need to make this distinction is illuminated by the fact that the document table has a referenc to the classification table and a references to the category table. However the row that is looked up in the classification table also references a category ID. This is not only redundant but also opens the door to the possibility of having conflicting category IDs.
I hope this helps.
If I have a User table and a Roles table.
What is the usual practice/pattern for adding the relationship?
Do I create an extra column in the User table for the RoleID or do people usually create a Relationships table like so:
Relationships Table
RelationshipID | UserID | RoleID |... any other relations a user might have
for the last bit, as a user you might create an endless amount of different types of things that all need to be related to you... do you instead add the relationship to each individual table created for each individual thing.. for example:
Pages Table
PageID | Title | Content | Author (UserID)
and so another table would also be similar to this:
Comments Table
CommentID | Comment | Author (UserID)
In this case, I would need to expand upon the Relationships table if I were to do it that way:
Relationships Table
RelationshipID | UserID | RoleID | CommentID
and i'd probably only want to fill in the UserID and CommentID as this relationship is not for the Roles... that is governed by another entry. so for example the values might be put in for a comment relationship:
AUTO | 2 | NULL | 16
I could imagine a multi purpose Revisions table being handy...
Revisions Table
RevisionID | DateCreated | UserID | ActionTypeID | ModelTypeID | Status | RelatedItemID
---------------------------------------------------------------------------------------
1 | <Now> | 3 | 4 (Delete) | 6 (Page) | TRUE | 38
2 | <Now> | 3 | 1 (Delete) | 5 (Comment) | TRUE | 10
3 | <Now> | 3 | 1 (Add) | 5 (Comment) | FALSE | 10
but not for a general Relationships table...
Does this sound correct?
Edit since comments:
They stated that the relationships table should be made due to the many-to-many (data model)
So let's take my previous example of my possible relationship table:
Relationships Table Old
RelationshipID | UserID | RoleID | CommentID... etc
Should it actually be something more like this:
Relationships Table New
RelationshipID | ItemID | LinkID | ItemType | LinkType | Status
---------------------------------------------------------------------------------
1 | 23(PageID) | 7(UserID) | ("Page") | ("User") | TRUE
2 | 22(CommentID) | 7(UserID) | ("Comment") | ("User") | TRUE
3 | 22(CommentID) | 23(PageID) | ("Comment") | ("Page") | TRUE
As an example, consider this hierarchical schema.
Assume all id fields are auto incrementing primary keys and that foreign keys are named by [parent_table_name]_id convention.
The problem
As soon as there are multiple companies in the database, then companies will share all primary key sequences between them.
For example, if there are two company rows, the customer_group table could look like this
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 1 |
-------------------
But it should look like this
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------
This behavior should also be exhibited for customer and any other table in the tree that directly or indirectly references company.
Note that I will most likely make a second id column (named something like relative_id) for this purpose, keeping the unique id column intact, as this is really mostly for display purposes and how users will reference these data entities.
Now if this was just one level of hierarchy, it would be a relatively simple solution.
I could make a table (table_name, company_id, current_id) and a trigger procedure that fires before insert on any of the tables, incrementing the current id by 1 and setting the row's relative_id to that value.
It's trivial when the company_id is right there in the insert query.
But how about the tables that don't reference company directly?
Like the lowest level of the hierarchy in this example, workorder, which only references customer.
Is there a clean, reusable solution to climb the ladder all the way from 'customer_id' to ultimately retrieve the parenting company_id?
Going recursively up the hierarchy with SELECTs on each INSERT doesn't sound too appealing to me, performance wise.
I also do not like the idea of just adding a foreign key to company for each of these tables, the schema would get increasingly uglier with each additional table.
But these are the two solutions I can see, but I may not be looking in the right places.
The company shouldn't care what the primary key is if you're using generated keys. They're supposed to be meaningless; compared for equality and nothing else. I grumbled about this earlier, so I'm really glad to see you write:
Note that I will most likely make a second id column (named something
like relative_id) for this purpose, keeping the unique id column
intact, as this is really mostly for display purposes and how users
will reference these data entities.
You're doing it right.
Most of the time it doesn't matter what the ID is, so you can just give them whatever comes out of a sequence and not care about holes/gaps. If you're concerned about inter-company leakage (unlikely) you can obfuscate the IDs by using the sequence as an input to a pseudo-random generator. See the function Daniel Verité wrote in response to my question about this a few years ago, pseudo_encrypt.
There are often specific purposes for which you need perfectly sequential gapless IDs, like invoice numbers. For those you need to use a counter table and - yes - look up the company ID. Such ID generation is slow and has terrible concurrency anyway, so an additional SELECT with a JOIN or two on indexed keys won't hurt much. Don't go recursively up the schema with SELECTs though, just use a series of JOINs. For example, for an insert into workorder your key generation trigger on workorder would be something like the (untested):
CREATE OR REPLACE FUNCTION workorder_id_tgfn() RETURNS trigger AS $$
BEGIN
IF tg_op = 'INSERT' THEN
-- Get a new ID, locking the row so no other transaction can add a
-- workorder until this one commits or rolls back.
UPDATE workorder_ids
SET next_workorder_id = next_workorder_id + 1
WHERE company_id = (SELECT company_id
FROM customer
INNER JOIN customer_group ON (customer.customer_group_id = customer_group.id)
INNER JOIN company ON (customer_group.company_id = company.id)
WHERE customer.id = NEW.customer_id)
RETURNING next_workorder_id
INTO NEW.id;
END IF;
END;
$$ LANGUAGE 'plpgsql';
For the UPDATE ... RETURNING ... INTO syntax see Executing a Query with a Single-Row Result.
There can be gaps in normal sequences even if there's no multi-company problem. Observe:
CREATE TABLE demo (id serial primary key, blah text);
BEGIN;
INSERT INTO demo(blah) values ('aa');
COMMIT;
BEGIN;
INSERT INTO demo(blah) values ('bb');
ROLLBACK;
BEGIN;
INSERT INTO demo(blah) values ('aa');
COMMIT;
SELECT * FROM demo;
Result:
regress=# SELECT * FROM demo;
id | blah
----+------
1 | aa
3 | aa
"But it should look like this"
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------
I think it should not and I think you want a many to many relationship. The customer_group table:
| id | name |
-------------
| 1 | n1 |
| 2 | n2 |
| 3 | n3 |
-------------
And then the customer_group_company table:
| group_id | company_id |
-------------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------------
I've created a form in PHP that collects basic information. I have a list box that allows multiple items selected (i.e. Housing, rent, food, water). If multiple items are selected they are stored in a field called Needs separated by a comma.
I have created a report ordered by the persons needs. The people who only have one need are sorted correctly, but the people who have multiple are sorted exactly as the string passed to the database (i.e. housing, rent, food, water) --> which is not what I want.
Is there a way to separate the multiple values in this field using SQL to count each need instance/occurrence as 1 so that there are no comma delimitations shown in the results?
Your database is not in the first normal form. A non-normalized database will be very problematic to use and to query, as you are actually experiencing.
In general, you should be using at least the following structure. It can still be normalized further, but I hope this gets you going in the right direction:
CREATE TABLE users (
user_id int,
name varchar(100)
);
CREATE TABLE users_needs (
need varchar(100),
user_id int
);
Then you should store the data as follows:
-- TABLE: users
+---------+-------+
| user_id | name |
+---------+-------+
| 1 | joe |
| 2 | peter |
| 3 | steve |
| 4 | clint |
+---------+-------+
-- TABLE: users_needs
+---------+----------+
| need | user_id |
+---------+----------+
| housing | 1 |
| water | 1 |
| food | 1 |
| housing | 2 |
| rent | 2 |
| water | 2 |
| housing | 3 |
+---------+----------+
Note how the users_needs table is defining the relationship between one user and one or many needs (or none at all, as for user number 4.)
To normalise your database further, you should also use another table called needs, and as follows:
-- TABLE: needs
+---------+---------+
| need_id | name |
+---------+---------+
| 1 | housing |
| 2 | water |
| 3 | food |
| 4 | rent |
+---------+---------+
Then the users_needs table should just refer to a candidate key of the needs table instead of repeating the text.
-- TABLE: users_needs (instead of the previous one)
+---------+----------+
| need_id | user_id |
+---------+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 1 | 2 |
| 4 | 2 |
| 2 | 2 |
| 1 | 3 |
+---------+----------+
You may also be interested in checking out the following Wikipedia article for further reading about repeating values inside columns:
Wikipedia: First normal form - Repeating groups within columns
UPDATE:
To fully answer your question, if you follow the above guidelines, sorting, counting and aggregating the data should then become straight-forward.
To sort the result-set by needs, you would be able to do the following:
SELECT users.name, needs.name
FROM users
INNER JOIN needs ON (needs.user_id = users.user_id)
ORDER BY needs.name;
You would also be able to count how many needs each user has selected, for example:
SELECT users.name, COUNT(needs.need) as number_of_needs
FROM users
LEFT JOIN needs ON (needs.user_id = users.user_id)
GROUP BY users.user_id, users.name
ORDER BY number_of_needs;
I'm a little confused by the goal. Is this a UI problem or are you just having trouble determining who has multiple needs?
The number of needs is the difference:
Len([Needs]) - Len(Replace([Needs],',','')) + 1
Can you provide more information about the Sort you're trying to accomplish?
UPDATE:
I think these Oracle-based posts may have what you're looking for: post and post. The only difference is that you would probably be better off using the method I list above to find the number of comma-delimited pieces rather than doing the translate(...) that the author suggests. Hope this helps - it's Oracle-based, but I don't see .