Rolling id based on foreign key in a hierarchical schema - sql

As an example, consider this hierarchical schema.
Assume all id fields are auto incrementing primary keys and that foreign keys are named by [parent_table_name]_id convention.
The problem
As soon as there are multiple companies in the database, then companies will share all primary key sequences between them.
For example, if there are two company rows, the customer_group table could look like this
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 1 |
-------------------
But it should look like this
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------
This behavior should also be exhibited for customer and any other table in the tree that directly or indirectly references company.
Note that I will most likely make a second id column (named something like relative_id) for this purpose, keeping the unique id column intact, as this is really mostly for display purposes and how users will reference these data entities.
Now if this was just one level of hierarchy, it would be a relatively simple solution.
I could make a table (table_name, company_id, current_id) and a trigger procedure that fires before insert on any of the tables, incrementing the current id by 1 and setting the row's relative_id to that value.
It's trivial when the company_id is right there in the insert query.
But how about the tables that don't reference company directly?
Like the lowest level of the hierarchy in this example, workorder, which only references customer.
Is there a clean, reusable solution to climb the ladder all the way from 'customer_id' to ultimately retrieve the parenting company_id?
Going recursively up the hierarchy with SELECTs on each INSERT doesn't sound too appealing to me, performance wise.
I also do not like the idea of just adding a foreign key to company for each of these tables, the schema would get increasingly uglier with each additional table.
But these are the two solutions I can see, but I may not be looking in the right places.

The company shouldn't care what the primary key is if you're using generated keys. They're supposed to be meaningless; compared for equality and nothing else. I grumbled about this earlier, so I'm really glad to see you write:
Note that I will most likely make a second id column (named something
like relative_id) for this purpose, keeping the unique id column
intact, as this is really mostly for display purposes and how users
will reference these data entities.
You're doing it right.
Most of the time it doesn't matter what the ID is, so you can just give them whatever comes out of a sequence and not care about holes/gaps. If you're concerned about inter-company leakage (unlikely) you can obfuscate the IDs by using the sequence as an input to a pseudo-random generator. See the function Daniel Verité wrote in response to my question about this a few years ago, pseudo_encrypt.
There are often specific purposes for which you need perfectly sequential gapless IDs, like invoice numbers. For those you need to use a counter table and - yes - look up the company ID. Such ID generation is slow and has terrible concurrency anyway, so an additional SELECT with a JOIN or two on indexed keys won't hurt much. Don't go recursively up the schema with SELECTs though, just use a series of JOINs. For example, for an insert into workorder your key generation trigger on workorder would be something like the (untested):
CREATE OR REPLACE FUNCTION workorder_id_tgfn() RETURNS trigger AS $$
BEGIN
IF tg_op = 'INSERT' THEN
-- Get a new ID, locking the row so no other transaction can add a
-- workorder until this one commits or rolls back.
UPDATE workorder_ids
SET next_workorder_id = next_workorder_id + 1
WHERE company_id = (SELECT company_id
FROM customer
INNER JOIN customer_group ON (customer.customer_group_id = customer_group.id)
INNER JOIN company ON (customer_group.company_id = company.id)
WHERE customer.id = NEW.customer_id)
RETURNING next_workorder_id
INTO NEW.id;
END IF;
END;
$$ LANGUAGE 'plpgsql';
For the UPDATE ... RETURNING ... INTO syntax see Executing a Query with a Single-Row Result.
There can be gaps in normal sequences even if there's no multi-company problem. Observe:
CREATE TABLE demo (id serial primary key, blah text);
BEGIN;
INSERT INTO demo(blah) values ('aa');
COMMIT;
BEGIN;
INSERT INTO demo(blah) values ('bb');
ROLLBACK;
BEGIN;
INSERT INTO demo(blah) values ('aa');
COMMIT;
SELECT * FROM demo;
Result:
regress=# SELECT * FROM demo;
id | blah
----+------
1 | aa
3 | aa

"But it should look like this"
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------
I think it should not and I think you want a many to many relationship. The customer_group table:
| id | name |
-------------
| 1 | n1 |
| 2 | n2 |
| 3 | n3 |
-------------
And then the customer_group_company table:
| group_id | company_id |
-------------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------------

Related

Select unique combination of values (attributes) based on user_id

I have a table that has user a user_id and a new record for each return reason for that user. As show here:
| user_id | return_reason |
|--------- |-------------- |
| 1 | broken |
| 2 | changed mind |
| 2 | overpriced |
| 3 | changed mind |
| 4 | changed mind |
What I would like to do is generate a foreign key for each combination of values that are applicable in a new table and apply that key to the user_id in a new table. Effectively creating a many to many relationship. The result would look like so:
Dimension Table ->
| reason_id | return_reason |
|----------- |--------------- |
| 1 | broken |
| 2 | changed mind |
| 2 | overpriced |
| 3 | changed mind |
Fact Table ->
| user_id | reason_id |
|--------- |----------- |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 3 |
My thought process is to iterate through the table with a cursor, but this seems like a standard problem and therefore has a more efficient way of doing this. Is there a specific name for this type of problem? I also thought about pivoting and unpivoting. But that didn't seem too clean either. Any help or reference to articles in how to process this is appreciated.
The problem concerns data normalization and relational integrity. Your concept doesn't really make sense - Dimension table shows two different reasons with same ID and Fact table loses a record. Conventional schema for this many-to-many relationship would be three tables like:
Users table (info about users and UserID is unique)
Reasons table (info about reasons and ReasonID is unique)
UserReasons junction table (associates users with reasons - your
existing table). Assuming user could associate with same reason
multiple times, probably also need ReturnDate and OrderID_FK fields
in UserReasons.
So, need to replace reason description in first table (UserReasons) with a ReasonID. Add a number long integer field ReasonID_FK in that table to hold ReasonID key.
To build Reasons table based on current data, use DISTINCT:
SELECT DISTINCT return_reason INTO Reasons FROM UserReasons
In new table, rename return_reason field to ReasonDescription and add an autonumber field ReasonID.
Now run UPDATE action to populate ReasonID_FK field in UserReasons.
UPDATE UserReasons INNER JOIN UserReasons.return_reason ON Reasons.ReasonDescription SET UserReasons.ReasonID_FK = Reasons.ReasonID
When all looks good, delete return_reason field.

Access text count in query design

I am new to Access and am trying to develop a query that will allow me to count the number of occurrences of one word in each field from a table with 15 fields.
The table simply stores test results for employees. There is one table that stores the employee identification - id, name, etc.
The second table has 15 fields - A1 through A15 with the words correct or incorrect in each field. I need the total number of incorrect occurrences for each field, not for the entire table.
Is there an answer through Query Design, or is code required?
The solution, whether Query Design, or code, would be greatly appreciated!
Firstly, one of the reasons that you are struggling to obtain the desired result for what should be a relatively straightforward request is because your data does not follow database normalisation rules, and consequently, you are working against the natural operation of a RDBMS when querying your data.
From your description, I assume that the fields A1 through A15 are answers to questions on a test.
By representing these as separate fields within your database, aside from the inherent difficulty in querying the resulting data (as you have discovered), if ever you wanted to add or remove a question to/from the test, you would be forced to restructure your entire database!
Instead, I would suggest structuring your table in the following way:
Results
+------------+------------+-----------+
| EmployeeID | QuestionID | Result |
+------------+------------+-----------+
| 1 | 1 | correct |
| 1 | 2 | incorrect |
| ... | ... | ... |
| 1 | 15 | correct |
| 2 | 1 | correct |
| 2 | 2 | correct |
| ... | ... | ... |
+------------+------------+-----------+
This table would be a junction table (a.k.a. linking / cross-reference table) in your database, supporting a many-to-many relationship between the tables Employees & Questions, which might look like the following:
Employees
+--------+-----------+-----------+------------+------------+-----+
| Emp_ID | Emp_FName | Emp_LName | Emp_DOB | Emp_Gender | ... |
+--------+-----------+-----------+------------+------------+-----+
| 1 | Joe | Bloggs | 01/01/1969 | M | ... |
| ... | ... | ... | ... | ... | ... |
+--------+-----------+-----------+------------+------------+-----+
Questions
+-------+------------------------------------------------------------+--------+
| Qu_ID | Qu_Desc | Qu_Ans |
+-------+------------------------------------------------------------+--------+
| 1 | What is the meaning of life, the universe, and everything? | 42 |
| ... | ... | ... |
+-------+------------------------------------------------------------+--------+
With this structure, if ever you wish to add or remove a question from the test, you can simply add or remove a record from the table without needing to restructure your database or rewrite any of the queries, forms, or reports which depends upon the existing structure.
Furthermore, since the result of an answer is likely to be a binary correct or incorrect, then this would be better (and far more efficiently) represented using a Boolean True/False data type, e.g.:
Results
+------------+------------+--------+
| EmployeeID | QuestionID | Result |
+------------+------------+--------+
| 1 | 1 | True |
| 1 | 2 | False |
| ... | ... | ... |
| 1 | 15 | True |
| 2 | 1 | True |
| 2 | 2 | True |
| ... | ... | ... |
+------------+------------+--------+
Not only does this consume less memory in your database, but this may be indexed far more efficiently (yielding faster queries), and removes all ambiguity and potential for error surrounding typos & case sensitivity.
With this new structure, if you wanted to see the number of correct answers for each employee, the query can be something as simple as:
select results.employeeid, count(*)
from results
where results.result = true
group by results.employeeid
Alternatively, if you wanted to view the number of employees answering each question correctly (for example, to understand which questions most employees got wrong), you might use something like:
select results.questionid, count(*)
from results
where results.result = true
group by results.questionid
The above are obviously very basic example queries, and you would likely want to join the Results table to an Employees table and a Questions table to obtain richer information about the results.
Contrast the above with your current database structure -
Per your original question:
The second table has 15 fields - A1 through A15 with the words correct or incorrect in each field. I need the total number of incorrect occurrences for each field, not for the entire table.
Assuming that you want to view the number of incorrect answers by employee, you are forced to use an incredibly messy query such as the following:
select
employeeid,
iif(A1='incorrect',1,0)+
iif(A2='incorrect',1,0)+
iif(A3='incorrect',1,0)+
iif(A4='incorrect',1,0)+
iif(A5='incorrect',1,0)+
iif(A6='incorrect',1,0)+
iif(A7='incorrect',1,0)+
iif(A8='incorrect',1,0)+
iif(A9='incorrect',1,0)+
iif(A10='incorrect',1,0)+
iif(A11='incorrect',1,0)+
iif(A12='incorrect',1,0)+
iif(A13='incorrect',1,0)+
iif(A14='incorrect',1,0)+
iif(A15='incorrect',1,0) as IncorrectAnswers
from
YourTable
Here, notice that the answer numbers are also hard-coded into the query, meaning that if you decide to add a new question or remove an existing question, not only would you need to restructure your entire database, but queries such as the above would also need to be rewritten.

Inventory Database Design

I got 2 tables User and Company.
These two have a inventory.
DataTables
Item
ItemId | Name
================
1 | Glass
4 | Wood
User
UId | Name
============
1 | Max
Company
CId | Name
==================
1 | EvilCorp
Inventory
RowId | UId | CId | ItemId | amount
=================================================
1 | 2 | Null | 4 | 10
2 | 23 | Null | 4 | 5
3 | Null | 1 | 1 | 7
4 | Null | 1 | 4 | 70
Let say I have 500 users and 300 companys and every one has 20 inventory slots, I will have 16000 null values in my Inventory table (6000 UId nulls + 10000 CId nulls).
I want a SQL query that will say this information.
Result
Owner | Item | Amount
===========================
MAX | Wood | 10
EvilCorp | Glass | 7
EvilCorp | Wood | 40
My problem is that my Inventory table is bad due to all the nulls that will appears against CId when the record is for a User, and vice versa.
Do you know how to create a good table, without huge or/and complex SQL queries?
You have Users, Companies and Owners. Users and Companies are Owners.
This is a common situation that can be put into tables various ways.
Every table holds the rows that make some statement true:
// "user [OId] has ..."
User(OId,...)
fk OId references Owner -- a user is an owner
// "company [OId] has ..."
Company(OId,....)
fk OId references Owner -- a company is an owner
// "Owner [OId] has ..."
Owner(OId,...)
// "[OId] owns [n] of item [itemId]"
Inventory(Oid,ItemId,amount)
fk OId references Owner
fk ItemId references Item
This comes from choosing tables that make plain statements about parts of your application. For variations read about subtables and supertables for subtypes and supertypes. See this answer or this one.

Display another field in the referenced table for multiple columns with performance issues in mind

I have a table of edge like this:
-------------------------------
| id | arg1 | relation | arg2 |
-------------------------------
| 1 | 1 | 3 | 4 |
-------------------------------
| 2 | 2 | 6 | 5 |
-------------------------------
where arg1, relation and arg2 reference to the ids of objects in another object table:
--------------------
| id | object_name |
--------------------
| 1 | book |
--------------------
| 2 | pen |
--------------------
| 3 | on |
--------------------
| 4 | table |
--------------------
| 5 | bag |
--------------------
| 6 | in |
--------------------
What I want to do is that, considering performance issues (a very big table more than 50 million of entries) display the object_name for each edge entry rather than id such as:
---------------------------
| arg1 | relation | arg2 |
---------------------------
| book | on | table |
---------------------------
| pen | in | bag |
---------------------------
What is the best select query to do this? Also, I am open to suggestions for optimizing the query - adding more index on the tables etc...
EDIT: Based on the comments below:
1) #Craig Ringer: PostgreSQL version: 8.4.13 and only index is id for both tables.
2) #andrefsp: edge is almost x2 times bigger than object.
If you can change the structure of the database, you may try to denormalize this part of the database and make table edge with fields id, arg1_name, relation_name, arg2_name. And keep table object without changes to take names for the edge table when you insert or update it.
It is not good. Your data will be duplicates (size of the database will be greater) and it may be difficult to insert or update tables.
But it should be fast to select (no JOINs):
SELECT arg1_name, relation_name, arg2_name
FROM edge;
It won't get cheaper than this:
SELECT o1.object_name, r1.object_name, o2.object_name
FROM edge e
JOIN object o1 ON o1.id = e.arg1
JOIN object r ON r.id = e.relation
JOIN object o2 ON o2.id = e.arg2;
And you don't need more indexes. The one on object.id is the only one needed for this query.
But I seriously doubt that you want to retrieve 50 millions of rows at once, and in no particular order. You still didn't give the full picture.

Retrieve comma delimited data from a field

I've created a form in PHP that collects basic information. I have a list box that allows multiple items selected (i.e. Housing, rent, food, water). If multiple items are selected they are stored in a field called Needs separated by a comma.
I have created a report ordered by the persons needs. The people who only have one need are sorted correctly, but the people who have multiple are sorted exactly as the string passed to the database (i.e. housing, rent, food, water) --> which is not what I want.
Is there a way to separate the multiple values in this field using SQL to count each need instance/occurrence as 1 so that there are no comma delimitations shown in the results?
Your database is not in the first normal form. A non-normalized database will be very problematic to use and to query, as you are actually experiencing.
In general, you should be using at least the following structure. It can still be normalized further, but I hope this gets you going in the right direction:
CREATE TABLE users (
user_id int,
name varchar(100)
);
CREATE TABLE users_needs (
need varchar(100),
user_id int
);
Then you should store the data as follows:
-- TABLE: users
+---------+-------+
| user_id | name |
+---------+-------+
| 1 | joe |
| 2 | peter |
| 3 | steve |
| 4 | clint |
+---------+-------+
-- TABLE: users_needs
+---------+----------+
| need | user_id |
+---------+----------+
| housing | 1 |
| water | 1 |
| food | 1 |
| housing | 2 |
| rent | 2 |
| water | 2 |
| housing | 3 |
+---------+----------+
Note how the users_needs table is defining the relationship between one user and one or many needs (or none at all, as for user number 4.)
To normalise your database further, you should also use another table called needs, and as follows:
-- TABLE: needs
+---------+---------+
| need_id | name |
+---------+---------+
| 1 | housing |
| 2 | water |
| 3 | food |
| 4 | rent |
+---------+---------+
Then the users_needs table should just refer to a candidate key of the needs table instead of repeating the text.
-- TABLE: users_needs (instead of the previous one)
+---------+----------+
| need_id | user_id |
+---------+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 1 | 2 |
| 4 | 2 |
| 2 | 2 |
| 1 | 3 |
+---------+----------+
You may also be interested in checking out the following Wikipedia article for further reading about repeating values inside columns:
Wikipedia: First normal form - Repeating groups within columns
UPDATE:
To fully answer your question, if you follow the above guidelines, sorting, counting and aggregating the data should then become straight-forward.
To sort the result-set by needs, you would be able to do the following:
SELECT users.name, needs.name
FROM users
INNER JOIN needs ON (needs.user_id = users.user_id)
ORDER BY needs.name;
You would also be able to count how many needs each user has selected, for example:
SELECT users.name, COUNT(needs.need) as number_of_needs
FROM users
LEFT JOIN needs ON (needs.user_id = users.user_id)
GROUP BY users.user_id, users.name
ORDER BY number_of_needs;
I'm a little confused by the goal. Is this a UI problem or are you just having trouble determining who has multiple needs?
The number of needs is the difference:
Len([Needs]) - Len(Replace([Needs],',','')) + 1
Can you provide more information about the Sort you're trying to accomplish?
UPDATE:
I think these Oracle-based posts may have what you're looking for: post and post. The only difference is that you would probably be better off using the method I list above to find the number of comma-delimited pieces rather than doing the translate(...) that the author suggests. Hope this helps - it's Oracle-based, but I don't see .