Storing & Querying Heirarchical Data with Multiple Parent Nodes - sql

I've been doing quite a bit of searching, but haven't been able to find many resources on the topic. My goal is to store scheduling data like you would find in a Gantt chart. So one example of storing the data might be:
Task Id | Name | Duration
1 Task A 1
2 Task B 3
3 Task C 2
Task Id | Predecessors
1 Null
2 Null
3 1
3 2
Which would have Task C waiting for both Task A and Task B to complete.
So my question is: What is the best way to store this kind of data and efficiently query it? Any good resources for this kind of thing? There is a ton of information about tree structures, but once you add in multiple parents it becomes hard to find info. By the way, I'm working with SQL Server and .NET for this task.

Your problem is related to the concept of relationship cardinality. All relationships have some cardinality, which expresses the potential number of instances on each side of the relationship that are members of it, or can participate in a single instance of the relationship. As an example, for people, (for most living things, I guess, with rare exceptions), the Parent-Child relationship has a cardinality of 2 to zero or many, meaning it takes two parents on the parent side, and there can be zero or many children (perhaps it should be 2 to 1 or many)
In database design, generally, anything that has a 1(one), (or a zero or one), on one side can be easily represented with just two tables, one for each entity, (sometimes only one table is needed see note**) and a foreign key column in the table representing the "many" side, that points to the other table holding the entity on the "one" side.
In your case you have a many to many relationship. (A Task can have multiple predecessors, and each predecessors can certainly be the predecessor for multiple tasks) In this case a third table is needed, where each row, effectively, represents an association between 2 tasks, representing that one is the predecessor to the other. Generally, This table is designed to contain only all the columns of the primary keys of the two parent tables, and it's own primary key is a composite of all the columns in both parent Primary keys. In your case it simply has two columns, the taskId, and the PredecessorTaskId, and this pair of Ids should be unique within the table so together they form the composite PK.
When querying, to avoid double counting data columns in the parent tables when there are multiple joins, simply base the query on the parent table... e.g., to find the duration of the longest parent,
Assuming your association table is named TaskPredecessor
Select TaskId, Max(P.Duration)
From Task T Join Task P
On P.TaskId In (Select PredecessorId
From TaskPredecessor
Where TaskId = T.TaskId)
** NOTE. In cases where both entities in the relationship are of the same entity type, they can both be in the same table. The canonical (luv that word) example is an employee table with the many to one relationship of Worker to Supervisor... Since the supervisor is also an employee, both workers and supervisors can be in the same [Employee] table, and the realtionship can gbe modeled with a Foreign Key (called say SupervisorId) that points to another row in the same table and contains the Id of the employee record for that employee's supervisor.

Use adjacency list model:
chain
task_id predecessor
3 1
3 2
and this query to find all predecessors of the given task:
WITH q AS
(
SELECT predecessor
FROM chain
WHERE task_id = 3
UNION ALL
SELECT c.predecessor
FROM q
JOIN chain c
ON c.task_id = q.predecessor
)
SELECT *
FROM q
To get the duration of the longest parent for each task:
WITH q AS
(
SELECT task_id, duration
FROM tasks
UNION ALL
SELECT t.task_id, t.duration
FROM q
JOIN chain с
ON c.task_id = q.task_id
JOIN tasks t
ON t.task_id = c.predecessor
)
SELECT task_id, MAX(duration)
FROM q

Check "Hierarchical Weighted total" pattern in "SQL design patterns" book, or "Bill Of Materials" section in "Trees and Hierarchies in SQL".
In a word, graphs feature double aggregation. You do one kind of aggregation along the nodes in each path, and another one across alternative paths. For example, find a minimal distance between the two nodes is minimum over summation. Hierarchical weighted total query (aka Bill Of Materials) is multiplication of the quantities along each path, and summation along each alternative path:
with TCAssembly as (
select Part, SubPart, Quantity AS factoredQuantity
from AssemblyEdges
where Part = ‘Bicycle’
union all
select te.Part, e.SubPart, e.Quantity * te.factoredQuantity
from TCAssembly te, AssemblyEdges e
where te.SubPart = e.Part
) select SubPart, sum(Quantity) from TCAssembly
group by SubPart

Related

Query to identify the parent/child relationship between two big tables

I have two tables. The first one contains laboratory result header records, one for each order. It has about 10 million rows in it that contain one of about 6,000 unique ProcedureIDs...
OrderID
ResultID
ProcedureID
ProcedureName
OrderDate
ResultDate
PatientID
ProviderID
The second table contains the detailed result record(s) for each order in the first table. It has about 80 million rows and contains about 28,000 child components that are associated with the 6,000 procedure IDs from the first table.
ResultComponentID
ResultID (foreign key to first table)
ComponentID
ComponentName
ResultValueType
ResultValue
ResultUnits
ResultingLab
I have a subset (n=135) procedure IDs for which I need a list of associated child component IDs. Here is a simple example...
Table 1
1000|1|CBC|Complete Blood Count|8/1/2019 08:00:00|8/2/2019 09:27:00|9999|8888
1001|2|CA|Calcium|8/1/2019 08:01:00|8/2/2019 09:28:00|9999|8888
Table 2
2543|1|RBC|Red Blood Cell Count|NM|60|Million/uL|OurLab
2544|1|PLT|Platelet Count|NM|60|Thou/cmm|OurLab
2545|2|RBC|Red Blood Cell Count|NM|60|Million/uL|OurLab
2546|1|CA|Calcium|NM|40|g/dl|OurLab
In this example, if CBC was in my subset and CA wasn't, I would expect two rows back...
CBC|Complete Blood Count|RBC|Red Blood Cell Count
CBC|Complete Blood Count|PLT|Platelet Count
Even if I had two million CBCs in the DB, I only need have one set of CBC parent/child rows.
If I were using a scripting tool, I would use a for each loop to iterate through the subset and grab the top 1 of each ProcedureID and use it to get the associated component children.
If I really wanted to go crazy with this, I would not assume that CBC only had two components, as some labs might send us two and some might send us seven.
Any advice on how to get the list of parent/child associations?
For the simple query, sometimes there is no way around just writing out all 135 ids if you can't find a neat way to get that subset out of a query or store it in a temp table.
For the uniqueness requirement, just add a 'group by'
Select t1.ProcedureId, t2.ComponentId
from Table1 t1
join Table2 t2 on t2.ResultId = t1.ResultId
where t1.ProcedureId in (
'CBC',
'etc', -- 135 times...
)
group by t1.ProcedureId, t2.ComponentId

Purpose of Self-Joins

I am learning to program with SQL and have just been introduced to self-joins. I understand how these work, but I don't understand what their purpose is besides a very specific usage, joining an employee table to itself to neatly display employees and their respective managers.
This usage can be demonstrated with the following table:
EmployeeID | Name | ManagerID
------ | ------------- | ------
1 | Sam | 10
2 | Harry | 4
4 | Manager | NULL
10 | AnotherManager| NULL
And the following query:
select worker.employeeID, worker.name, worker.managerID, manager.name
from employee worker join employee manager
on (worker.managerID = manager.employeeID);
Which would return:
Sam AnotherManager
Harry Manager
Besides this, are there any other circumstances where a self-join would be useful? I can't figure out a scenario where a self-join would need to be performed.
Your example is a good one. Self-joins are useful whenever a table contains a foreign key into itself. An employee has a manager, and the manager is... another employee. So a self-join makes sense there.
Many hierarchies and relationship trees are a good fit for this. For example, you might have a parent organization divided into regions, groups, teams, and offices. Each of those could be stored as an "organization", with a parent id as a column.
Or maybe your business has a referral program, and you want to record which customer referred someone. They are both 'customers', in the same table, but one has a FK link to another one.
Hierarchies that are not a good fit for this would be ones where an entity might have more than one "parent" link. For example, suppose you had facebook-style data recording every user and friendship links to other users. That could be made to fit in this model, but then you'd need a new "user" row for every friend that a user had, and every row would be a duplicate except for the "relationshipUserID" column or whatever you called it.
In many-to-many relationships, you would probably have a separate "relationship" table, with a "from" and "to" column, and perhaps a column indicating the relationship type.
I found self joins most useful in situations like this:
Get all employees that work for the same manager as Sam. (This does not have to be hierarchical, this can also be: Get all employees that work at the same location as Sam)
select e2.employeeID, e2.name
from employee e1 join employee e2
on (e1.managerID = e2.managerID)
where e1.name = 'Sam'
Also useful to find duplicates in a table, but this can be very inefficient.
There are several great examples of using self-joins here. The one I often use relates to "timetables". I work with timetables in education, but they are relevant in other cases too.
I use self-joins to work out whether two items clash with one another, e.g. a student is scheduled for two lessons which happen at the same time, or a room is double booked. For example:
CREATE TABLE StudentEvents(
StudentId int,
EventId int,
EventDate date,
StartTime time,
EndTime time
)
SELECT
se1.StudentId,
se1.EventDate,
se1.EventId Event1Id,
se1.StartTime as Event1Start,
se1.EndTime as Event1End,
se2.StartTime as Event2Start,
se2.EndTime as Event2End,
FROM
StudentEvents se1
JOIN StudentEvents se2 ON
se1.StudentId = se2.StudentId
AND se1.EventDate = se2.EventDate
AND se1.EventId > se2.EventId
--The above line prevents (a) an event being seen as clashing with itself
--and (b) the same pair of events being returned twice, once as (A,B) and once as (B,A)
WHERE
se1.StartTime < se2.EndTime AND
se1.EndTime > se2.StartTime
Similar logic can be used to find other things in "timetable data", such as a pair of trains it is possible to take from A via B to C.
Self joins are useful whenever you want to compare records of the same table against each other. Examples are: Find duplicate addresses, find customers where the delivery address is not the same as the invoice address, compare a total in a daily report (saved as record) with the total of the previous day etc.

Joins explained by Venn Diagram with more than one join

http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
and
https://web.archive.org/web/20120621231245/http://www.khankennels.com/blog/index.php/archives/2007/04/20/getting-joins/
have been very helpful in learning the basics of joins using Venn diagrams. But I am wondering how you would apply that same thinking to a query that has more than one join.
Let's say I have 3 tables:
Employees
EmployeeID
FullName
EmployeeTypeID
EmployeeTypes (full time, part time, etc.)
EmployeeTypeID
TypeName
InsuranceRecords
InsuranceRecordID
EmployeeID
HealthInsuranceNumber
Now, I want my final result set to include data from all three tables, in this format:
EmployeeID | FullName | TypeName | HealthInsuranceNumber
Using what I learned from those two sites, I can use the following joins to get all employees, regardless of whether or not their insurance information exists or not:
SELECT
Employees.EmployeeID, FullName, TypeName, HealthInsuranceNumber
FROM Employees
INNER JOIN EmployeeTypes ON Employees.EmployeeTypeID = EmployeeTypes.EmployeeTypeID
LEFT OUTER JOIN InsuranceRecords ON Employees.EmployeeID = InsuranceRecords.EmployeeID
My question is, using the same kind of Venn diagram pattern, how would the above query be represented visually? Is this picture accurate?
I think it is not quite possible to map your example onto these types of diagrams for the following reason:
The diagrams here are diagrams used to describe intersections and unions in set theory. For the purpose of having an overlap as depicted in the diagrams, all three diagrams need to contain elements of the same type which by definition is not possible if we are dealing with three different tables where each contains a different type of (row-)object.
If all three tables would be joined on the same key then you could identify the values of this key as the elements the sets contain but since this is not the case in your example these kind of pictures are not applicable.
If we do assume that in your example both joins use the same key, then only the green area would be the correct result since the first join restricts you to the intersection of Employees and Employee types and the second join restricts you the all of Employees and since both join conditions must be true you would get the intersection of both of the aforementioned sections which is the green area.
Hope this helps.
That is not an accurate set diagram (either Venn or Euler). There are no entities which are members of both Employees and Employee Types. Even if your table schema represented some kind of table-inheritance, all the entities would still be in a base table.
Jeff's example on the Coding Horror blog only works with like entities i.e. two tables containing the same entities - technically a violation of normalization - or a self-join.
Venn diagrams can accurately depict scenarios like:
-- The intersection lens
SELECT *
FROM table
WHERE condition1
AND condition2
-- One of the lunes
SELECT *
FROM table
WHERE condition1
AND NOT condition2
-- The union
SELECT *
FROM table
WHERE condition1
OR condition2

Dynamic Tables?

I have a database that has different grades per course (i.e. three homeworks for Course 1, two homeworks for Course 2, ... ,Course N with M homeworks). How should I handle this as far as database design goes?
CourseID HW1 HW2 HW3
1 100 99 100
2 100 75 NULL
EDIT
I guess I need to rephrase my question. As of right now, I have two tables, Course and Homework. Homework points to Course through a foreign key. My question is how do I know how many homeworks will be available for each class?
No, this is not a good design. It's an antipattern that I called Metadata Tribbles. You have to keep adding new columns for each homework, and they propagate out of control.
It's an example of repeating groups, which violates the First Normal Form of relational database design.
Instead, you should create one table for Courses, and another table for Homeworks. Each row in Homeworks references a parent row in Courses.
My question is how do I know how many homeworks will be available for each class?
You'd add rows for each homework, then you can count them as follows:
SELECT CourseId, COUNT(*) AS Num_HW_Per_Course
FROM Homeworks
GROUP BY CourseId
Of course this only counts the homeworks after you have populated the table with rows. So you (or the course designers) need to do that.
Decompose the table into three different tables. One holds the courses, the second holds the homeworks, and the third connects them and stores the result.
Course:
CourseID CourseName
1 Foo
Homework:
HomeworkID HomeworkName HomeworkDescription
HW1 Bar ...
Result:
CourseID HomeworkID Result
1 HW1 100

Modeling Existential Facts in a Relational Database

I need a way to represent existential relations in a database. For instance I have a bio-historical table (i.e. a family tree) that stores a parent id and a child id which are foreign keys to a people table. This table is used to describe arbitrary family relationships. Thus I’d like to be able to say that X and Y are siblings without having to know exactly who the parents of X and Y are. I just want to be able to say that there exists two different people A and B such that A and B are each parents of X and Y. Once I do know who A and/or B are I’d need to be able to reconcile them.
The simplest solution I can think of is to store existential people with negative integer user ids. Once I know who the people are, I’d need to cascade update all of the IDs. Are there any well-known techniques for this?
Does existential mean "non existant"?
They don't have to be negative. You could just add a record to People table with no last/first name and perhaps a flag "unknown person". Or existential if you like.
Then when you know something (e.g. like last name but not first) you update this record.
Reconciling duplicate people could be more difficult. I guess you could just update FamilyTree set parent_id=new_id where parent_id=old_id, etc. But this means for instance that the same person could end up with too many parents, so you'll need to perform a number of complex checks before doing that.
I would document only the known relationships in a link table which links your Person table to itself with:
FK Person1ID
FK Person2ID
RelationshipTypeID (Sibling, Father, Mother, Step-Father, Step-Mother, etc.)
With some appropriate constraints on that table (or multiple tables, one for each relationship type if that makes the constraints more logical)
Then when other relationships can possibly (a half-sibling will only share one parent) be inferred (by running an exception query) but are missing, create them.
For instance, people who are siblings who don't have all their parents identified:
SELECT *
FROM People p1
INNER JOIN Relationship r_sibling
ON r_sibling.Person1ID = p1.PersonID
AND r_sibling.RelationshipType = SIBLING_TYPE_CONSTANT
INNER JOIN People p2
ON r_sibling.Person2ID = p2.PersonID
WHERE EXISTS (
-- p1 has a father
SELECT *
FROM Relationship r_father
ON r_father.RelationshipType = FATHER_TYPE_CONSTANT
AND r_father.Person2ID = p1.PersonID
)
AND NOT EXISTS (
-- p2 (p1's sibling) doesn't have a father yet
SELECT *
FROM Relationship r_father
ON r_father.RelationshipType = FATHER_TYPE_CONSTANT
AND r_father.Person2ID = p2.PersonID
)
You might need to UNION the reverse of this query depending on how you want your relationships constrained (siblings are always commutative, unlike other relationships) and then handle mothers similarly.
Hmmm, come to think of it, I guess I need a general way to reconcile duplicate people anyway and I can use it for this purpose. Thoughts?