SQL Schema for multiple many-to-many relationships - sql

Consider I have 4 tables
persons
companies
groups
and
bills
Now there is a many-to-many relationship between bills/persons and bills/companies and bills/groups.
I see 4 possibilities for a sql schema for this:
variant 1 (multiple relationship tables)
persons_bills
person_id
bill_id
companies_bills
company_id
bill_id
groups_bills
group_id
bill_id
variant 2 (one relationship table with one id set and all others null)
bills_relations
person_id
company_id
group_id
bill_id
with a check that only person_id OR company_id OR group_id can be set and all other twos are null.
variant 3 (one relationship table with string reference to the other table)
bills_relations
bill_id
row_id
row_table
with row_table can have the string values 'person', 'company', 'group'.
variant 4 (add a supertype table)
persons
id
debtor_id
companies
id
deptor_id
groups
id
deptor_id
deptors
id
bills_deptors
bill_id
deptor_id
Can you recommend one variant?

I think that either variant 1 (multiple relationship tables) or variant 4 (add a supertype table) are the most feasible choices here.
Variant 2 is a much less efficient way to store the data since it requires the storage of 3 extra NULLs for each relationship.
Variant 3 will get you into a lot of trouble when trying to JOIN between bills and one of the other tables, since you won't be able to do it directly. You'll have to first select the table name from the string reference, and then inject it into a second query. Any kind of SQL injections like this open up the database to a SQL injection attack, so they are best avoided if possible.
Variant 1 is probably the best out of 1 and 4 in my opinion, since it will require one less JOIN in your queries and hence make them a little simpler. If all the tables are indexed correctly though, I don't think there should be much difference in performance (or space efficiency) between these two.

Related

Problem with running Efficient search on a DB table based on multiple permutations

I have a table with StudentId to subject mapping and below are the two possible schema's
DB: RDBMS
schema1:
StudentId
Subject
1
a
1
b
2
a
2
c
3
b
4
b
4
c
schema2:
StudentId
Subjects
1
a, b
2
a, c
3
b
4
b, c
Constraints:
The maximum number of subjects that a student can subscribe to can be 10. The total number of subjects available are ~50.
Use Case:
I would like to filter students subscribed to a list of subjects, only those requested subjects and nothing outside.
Eg: if I look for students subscribed to subjects a,b then the result should only include students subscribed to a,b or a or b. not a,c or b,c
If I run this search in the sample schema provided above then
Expected result: StudentId 1 and 3. Note that 2 and 4 are not part of the result because of c
Solution1:
If I go with Schema1 then I have to query table "where subject in ('a' , 'b')" which will return all studentIds and I will have have additional logic after fetching the result and discard studentId 2 and 4
And the same with schema2 as well where i run a LIKE "%a%" or LIKE "%b%" which returns all the result.
I can also create a list of possible combinations to query like 'a' or 'b' or 'a,b' but this combinations set will grow huge if requesting for multiple subjects.
Problem:
Always returns a huge result set which will need additional processing.
Solution2:
select distinct(StudentId) from schema1 where subject NOT IN ('c');
select StudentId from schema2 where subjects NOT LIKE "%c%";
Here I can directly fetch only the studentIds as expected but this solution has the below problem
Will involve me to know all available subjects
Run a query filter with all subjects except for the requested subjects
The subject list can grow over time and the query becomes inefficient
Solution3:
select distinct(StudentId)
from schema1 where StudentId NOT IN (
select distinct(StudentId)
from schema1 where subject NOT IN ('a', 'b')
);
Is there an efficient way to run this query? Or a better schema design that would help with the use case? Is there other database technologies that are better at solving such problems efficiently?
There is no consideration at all. A table represents an entity. Your data has two entities: students and subjects.
These entities are connected by an n-m relationship: any student can have multiple subjects; any subject can have multiple students.
So, you want three tables:
students
subjects
studentSubjects
In relational databases, you do not want to store multiple items in a single column, particularly in a string column. Of your two options, the first is basically correct, but I would recommend a subjects table and numeric ids for studentIds and subjectIds.

SQLite JOIN two tables with duplicated keys

I need to join two tables on two different fields. I have table 1 like this:
key productid customer
1 100 jhon
2 109 paul
3 100 john
And table 2 has same fields but aditional data I must relate to first table
key productid customer status date ...
1 109 phil ok 04/01
2 109 paul nok 04/03
3 100 jhon nok 04/06
4 100 jhon ok 04/06
Both "key" fields are autoincrement. Problem is that my relationship fields are repeated several times across result and I need to generate a one-to-one relationship, in such manner that one row from table 2 must be related ONLY ONCE with a row on table 1.
I did a left join on (customer=customer and productid=productid) but relationship came out duplicated, a row from tablet 2 was related many times to rows of table one.
To clarify things...
I have to cross check both tables, table 1 is loaded from an XLS report, table 2 is data from a database that reflects customer transactions with many status data. I have to check if a row from XLS exists in database and then load additional status data. I must produce a report when rows from XLS has no correspondent data on database.
How can accomplish this JOIN, is this possible with only SQL?
You can accomplish this in MS SQL using the sql below. Not sure if SQLite supports this.
select a.*, c.*
from table2 a, ( select min(key) key, productid, customer
from table1
group by productid, customer
) b,
table1 c
where a.productid = b.productid
and a.customer = b.customer
and b.key = c.key
One way to understand this would be to figure out what each table represents exactly. Both tables seem to represent the same thing, with a row representing what you might call a purchase. Why are there two separate tables, then? Perhaps the second table goes into more depth about each purchase? Like jhon bought product 100, and it was 'nok' first and then 'ok'? Is so, then the key (what makes the table unique) for the second table would be all three fields.
You still join on only the two fields that match, but you can't expect uniqueness if there are two rows with the same unique keys.
It helps sometimes to create additional indexes on a table, to see what is truly unique.

SQL Merging Associated Records

Let's say we have a database with a table that has many other associated tables. If you diagrammed the database, this would be the table at the center with many foreign key relationships spiraling out of it.
To make it more concrete, let's say the two records in this central table are Initech and Contoso. Initech and Contoso are both associated with many other records in associated tables like Employees, AccountingTransactions, etc. Let's say the two merged (Initech bought Contoso) and from a data standpoint, it really is as simple as merging all the records. What's the easiest way to take all of Contoso's related records, make them point to Initech and then delete Contoso?
UPDATE with CASCADE comes tantalizingly close, but it obviously can't work without turning off constraints and then turning them back on (yuck).
Is there a nice generic way to do this without hunting down every single linked table and migrating them one by one? This has to be a common requirement. It's come up in two places in this project and can be summed up with: Entity A needs to control everything Entity B current controls. How can I make it happen?
Before Merge:
Companies
ID Name
1 Contoso
2 Initech
Employees
ID Name CompanyId
1 Bob 1
2 Ted 2
After Merge:
Companies
ID Name
2 Initech
Employees
ID Name CompanyId
1 Bob 2
2 Ted 2
All my attempts at searching only turned up questions about merging separate databases... so sorry if this has been asked before.
This query is likely vendor-dependent, but in MySQL:
UPDATE Employees e, Cars c, OtherEntity o
SET e.CompanyId = 2, c.CompanyId = 2, o.CompanyID = 2
WHERE e.CompanyID = 1 OR c.CompanyId = 1 OR o.CompanyId = 1;
Succinctly, no; there isn't a generic way to do it.
Consider your sample database with tables Companies, Employees, Departments, and AccountingTransactions.
You need to delete one of the company records (because after the merger, you will only record the current state of affairs).
You need to alter the employee records to change the employing company. However, it is quite possible that there is an employee number N in both companies, and one of those (presumably Contoso's) will have to be assigned a new employee number.
You probably face the problem that department 1 in Conotoso's data is Engineering, but in Initech's is Finance. So, you need to worry about how you are going to map the department numbers between the two companies, and then you face the problem of assigning Contoso's employees to Initech's departments.
For the historical accounting transactions, you probably have to keep Contoso's historical accounting records in Contoso's name, while some (of the most recent) transactions will need to be migrated to Initech's name. So maybe you won't be deleting the Contoso record from the table of companies after all, but you won't be able to use it to identify any new records.
These are just a small sampling of the reasons why such mappings cannot readily be automated.
No, there's no simple generic way of merging rows and cascading those changes throughout your system. You can script it all - which may be the best way, depending on your scenario - or devise a workaround.
One workaround might be to implement a parenting pattern on your central table (or abstract it to another table). You would then end up with something like
Companies
ID ParentID Name
1 2 Contoso
2 null Initech
or
Companies
ID ParentID Name
1 3 Contoso
2 3 Initech
3 null MegaInitech
and all your queries that join onto this central Companies table now check ID and ParentID;
SELECT *
FROM Employees
WHERE CompanyId IN (SELECT ID FROM Companies WHERE ID = #id OR ParentID = #ID)
Abstract this away to a view or function
CREATE FUNCTION fn_IsMemberOf
(
#companyId INT,
#parentId INT
)
RETURNS BIT
AS
BEGIN
DECLARE #result BIT = 0
SELECT #result = 1 FROM Companies
WHERE ID = #companyId
AND COALESCE(ParentID, ID) = #parentID
RETURN #result
END
SELECT *
FROM Employees
WHERE fn_IsMemberOf(CompanyId, 1) = 1
(haven't tested this but you get the idea)

Storing & Querying Heirarchical Data with Multiple Parent Nodes

I've been doing quite a bit of searching, but haven't been able to find many resources on the topic. My goal is to store scheduling data like you would find in a Gantt chart. So one example of storing the data might be:
Task Id | Name | Duration
1 Task A 1
2 Task B 3
3 Task C 2
Task Id | Predecessors
1 Null
2 Null
3 1
3 2
Which would have Task C waiting for both Task A and Task B to complete.
So my question is: What is the best way to store this kind of data and efficiently query it? Any good resources for this kind of thing? There is a ton of information about tree structures, but once you add in multiple parents it becomes hard to find info. By the way, I'm working with SQL Server and .NET for this task.
Your problem is related to the concept of relationship cardinality. All relationships have some cardinality, which expresses the potential number of instances on each side of the relationship that are members of it, or can participate in a single instance of the relationship. As an example, for people, (for most living things, I guess, with rare exceptions), the Parent-Child relationship has a cardinality of 2 to zero or many, meaning it takes two parents on the parent side, and there can be zero or many children (perhaps it should be 2 to 1 or many)
In database design, generally, anything that has a 1(one), (or a zero or one), on one side can be easily represented with just two tables, one for each entity, (sometimes only one table is needed see note**) and a foreign key column in the table representing the "many" side, that points to the other table holding the entity on the "one" side.
In your case you have a many to many relationship. (A Task can have multiple predecessors, and each predecessors can certainly be the predecessor for multiple tasks) In this case a third table is needed, where each row, effectively, represents an association between 2 tasks, representing that one is the predecessor to the other. Generally, This table is designed to contain only all the columns of the primary keys of the two parent tables, and it's own primary key is a composite of all the columns in both parent Primary keys. In your case it simply has two columns, the taskId, and the PredecessorTaskId, and this pair of Ids should be unique within the table so together they form the composite PK.
When querying, to avoid double counting data columns in the parent tables when there are multiple joins, simply base the query on the parent table... e.g., to find the duration of the longest parent,
Assuming your association table is named TaskPredecessor
Select TaskId, Max(P.Duration)
From Task T Join Task P
On P.TaskId In (Select PredecessorId
From TaskPredecessor
Where TaskId = T.TaskId)
** NOTE. In cases where both entities in the relationship are of the same entity type, they can both be in the same table. The canonical (luv that word) example is an employee table with the many to one relationship of Worker to Supervisor... Since the supervisor is also an employee, both workers and supervisors can be in the same [Employee] table, and the realtionship can gbe modeled with a Foreign Key (called say SupervisorId) that points to another row in the same table and contains the Id of the employee record for that employee's supervisor.
Use adjacency list model:
chain
task_id predecessor
3 1
3 2
and this query to find all predecessors of the given task:
WITH q AS
(
SELECT predecessor
FROM chain
WHERE task_id = 3
UNION ALL
SELECT c.predecessor
FROM q
JOIN chain c
ON c.task_id = q.predecessor
)
SELECT *
FROM q
To get the duration of the longest parent for each task:
WITH q AS
(
SELECT task_id, duration
FROM tasks
UNION ALL
SELECT t.task_id, t.duration
FROM q
JOIN chain с
ON c.task_id = q.task_id
JOIN tasks t
ON t.task_id = c.predecessor
)
SELECT task_id, MAX(duration)
FROM q
Check "Hierarchical Weighted total" pattern in "SQL design patterns" book, or "Bill Of Materials" section in "Trees and Hierarchies in SQL".
In a word, graphs feature double aggregation. You do one kind of aggregation along the nodes in each path, and another one across alternative paths. For example, find a minimal distance between the two nodes is minimum over summation. Hierarchical weighted total query (aka Bill Of Materials) is multiplication of the quantities along each path, and summation along each alternative path:
with TCAssembly as (
select Part, SubPart, Quantity AS factoredQuantity
from AssemblyEdges
where Part = ‘Bicycle’
union all
select te.Part, e.SubPart, e.Quantity * te.factoredQuantity
from TCAssembly te, AssemblyEdges e
where te.SubPart = e.Part
) select SubPart, sum(Quantity) from TCAssembly
group by SubPart

Dynamic Tables?

I have a database that has different grades per course (i.e. three homeworks for Course 1, two homeworks for Course 2, ... ,Course N with M homeworks). How should I handle this as far as database design goes?
CourseID HW1 HW2 HW3
1 100 99 100
2 100 75 NULL
EDIT
I guess I need to rephrase my question. As of right now, I have two tables, Course and Homework. Homework points to Course through a foreign key. My question is how do I know how many homeworks will be available for each class?
No, this is not a good design. It's an antipattern that I called Metadata Tribbles. You have to keep adding new columns for each homework, and they propagate out of control.
It's an example of repeating groups, which violates the First Normal Form of relational database design.
Instead, you should create one table for Courses, and another table for Homeworks. Each row in Homeworks references a parent row in Courses.
My question is how do I know how many homeworks will be available for each class?
You'd add rows for each homework, then you can count them as follows:
SELECT CourseId, COUNT(*) AS Num_HW_Per_Course
FROM Homeworks
GROUP BY CourseId
Of course this only counts the homeworks after you have populated the table with rows. So you (or the course designers) need to do that.
Decompose the table into three different tables. One holds the courses, the second holds the homeworks, and the third connects them and stores the result.
Course:
CourseID CourseName
1 Foo
Homework:
HomeworkID HomeworkName HomeworkDescription
HW1 Bar ...
Result:
CourseID HomeworkID Result
1 HW1 100