Modeling Existential Facts in a Relational Database - sql

I need a way to represent existential relations in a database. For instance I have a bio-historical table (i.e. a family tree) that stores a parent id and a child id which are foreign keys to a people table. This table is used to describe arbitrary family relationships. Thus I’d like to be able to say that X and Y are siblings without having to know exactly who the parents of X and Y are. I just want to be able to say that there exists two different people A and B such that A and B are each parents of X and Y. Once I do know who A and/or B are I’d need to be able to reconcile them.
The simplest solution I can think of is to store existential people with negative integer user ids. Once I know who the people are, I’d need to cascade update all of the IDs. Are there any well-known techniques for this?

Does existential mean "non existant"?
They don't have to be negative. You could just add a record to People table with no last/first name and perhaps a flag "unknown person". Or existential if you like.
Then when you know something (e.g. like last name but not first) you update this record.
Reconciling duplicate people could be more difficult. I guess you could just update FamilyTree set parent_id=new_id where parent_id=old_id, etc. But this means for instance that the same person could end up with too many parents, so you'll need to perform a number of complex checks before doing that.

I would document only the known relationships in a link table which links your Person table to itself with:
FK Person1ID
FK Person2ID
RelationshipTypeID (Sibling, Father, Mother, Step-Father, Step-Mother, etc.)
With some appropriate constraints on that table (or multiple tables, one for each relationship type if that makes the constraints more logical)
Then when other relationships can possibly (a half-sibling will only share one parent) be inferred (by running an exception query) but are missing, create them.
For instance, people who are siblings who don't have all their parents identified:
SELECT *
FROM People p1
INNER JOIN Relationship r_sibling
ON r_sibling.Person1ID = p1.PersonID
AND r_sibling.RelationshipType = SIBLING_TYPE_CONSTANT
INNER JOIN People p2
ON r_sibling.Person2ID = p2.PersonID
WHERE EXISTS (
-- p1 has a father
SELECT *
FROM Relationship r_father
ON r_father.RelationshipType = FATHER_TYPE_CONSTANT
AND r_father.Person2ID = p1.PersonID
)
AND NOT EXISTS (
-- p2 (p1's sibling) doesn't have a father yet
SELECT *
FROM Relationship r_father
ON r_father.RelationshipType = FATHER_TYPE_CONSTANT
AND r_father.Person2ID = p2.PersonID
)
You might need to UNION the reverse of this query depending on how you want your relationships constrained (siblings are always commutative, unlike other relationships) and then handle mothers similarly.

Hmmm, come to think of it, I guess I need a general way to reconcile duplicate people anyway and I can use it for this purpose. Thoughts?

Related

What is a self join for? (in english)

I already know what a self-join does. Thank you, I also read all the other computerised operational descriptions on stack overflow, so I know this is not actually a duplicate question, so please do not give me tables or join lists.
What I am seeking to why it would be done (and please, not just the self-referencing employee-manager example).
In plain English, what would I seek to achieve from a self join?
My usage is in a university course, and coming from a Relational Algebra angle. I have done some SQL for a few years but the instructor loves to do self-joins on tables (after renaming one or more fields). Not something often done in SQL, so I'm wondering what the action is that he is trying to perform but he seems pretty keen about doing it frequently.
I thought I'd ask here, as many others have asked for this information but get marked as "already answered" but all the answers give operational descriptions NOT the "why is this being done".
The reason why the employee-manager example is so common, is because it hits the nail on the head. A self join on a table looks for pairs of rows, like any join, but with both rows coming from the same table. Nothing special really.
The database designer gives each base table a predicate (sentence template parameterized by column names).
Parent(person, child) -- person PERSON is parent of person CHILD
Likes(person, food) -- person PERSON likes food FOOD
Relational algebra is designed so that the value of a relational expression (base table name or operator call) holds the rows that make a true proposition (statement) from its predicate.
/* (PERSON, CHILD) rows where
person PERSON is parent of person CHILD
*/
Parent
The predicate of an expression that is a call to operator NATURAL JOIN is the AND of the predicates of its inputs.
/* (PERSON, CHILD, FOOD) rows where
person PERSON is parent of person CHILD AND person PERSON likes food FOOD
*/
Parent NATURAL JOIN Likes
Ditto for UNION & OR, MINUS & AND NOT, PROJECT column(s) & EXISTS other column(s), RESTRICT condition & AND condition and RENAME of a column & rename of a parameter.
/* (CHILD, FOOD) rows where
there EXISTS a value for PERSON such that
person PERSON is parent of person CHILD AND person CHILD likes food FOOD
*/
PROJECT child, food (Parent NATURAL JOIN (RENAME person:=child Likes))
So every query expression's value holds the rows that make its predicate into a true statement.
Suppose we define algebraic self-join of a table as NATURAL JOIN of two tables got from an original via sequences of zero or more renamings. Per above we NATURAL JOIN for rows that satisfy the AND of predicates. A self-join arises when we want the rows that satisfy a result predicate expressed via predicates that differ only in parameters/columns.
/* (PERSON, FOOD, CHILD) rows where
person PERSON likes food FOOD AND person CHILD likes food FOOD
*/
Likes NATURAL JOIN (RENAME person:=child Likes)
There's nothing special about a self-join arising in a given query in a given application other than that.
SQL SELECT DISTINCT statements can be described via algebraic operators. They also calculate query predicates. First FROM table columns are RENAMEd by prefixing a table alias (correlation name) & a dot. (SQL NATURAL JOIN doesn't dot common columns.) The new tables are NATURAL JOINed. ON and WHERE RESTRICT per a condition. Then the SELECT DISTINCT clause RENAMES to remove dots from returned columns & PROJECTS away unwanted dotted columns.
We can convert SQL to predicates directly: Dotting input columns renames. NATURAL/CROSS/INNER JOIN, ON & WHERE give AND. Each dot-free result column gives an AND that it equals its dotted version. Finally dropping all dotted columns gives EXISTS.
/* same as above */
/* (PERSON, FOOD, CHILD) rows where
there EXISTS values for P.* & C.* such that
PERSON = P.PERSON AND CHILD = C.person AND FOOD = P.FOOD
AND person P.CHILD likes food P.FOOD
AND person C.CHILD likes food C.FOOD
AND P.FOOD = C.FOOD
*/
SELECT DISTINCT p.person AS person, c.person AS child, p.food AS food
FROM Likes p INNER JOIN Likes c
ON p.food = c.food
Again: In SQL we say there is a self-join when multiple table aliases of a JOIN are associated with the same table value; in application terms that means we can express a query meaning in terms of predicates differing in some parameters/columns; there's nothing special about applications or table meanings for this to arise.
See this re query semantics which happens to include a link to this re self-join semantics in particular.

SQL query for recipes that can be made from collection of ingredients

I have a database that has a table of Ingredients I and a table of Recipes R. The two tables have a many-to-many relationship, as one recipes uses many ingredients and one ingredient is used in many recipes. I have a third cross-reference table that uses the cross-reference validation pattern to enforce my many-to-many relationship, and is done using string foreign keys (instead of integers).
Assuming I have a collection of ingredients C outside of my database, how can I query Recipe table R for every recipe that can be made using ONLY the list of ingredients supplied in C?
Other things to consider
1) Speed will (of course) be a concern eventually, but correctness is what I'm stuck on at the moment.
2) The collection of ingredients C might be very large (~100 ingredients).
Any answers or even just pointers in the right direction would be greatly appreciated.
Thanks,
Alec
One way is to write:
select ...
from R
where ID not in
( select R_ID
from RI
where I_ID not in
( select I_ID
from C
)
)
;
That is: start with C. Select all recipe–ingredient cross-references where the ingredient is not in C. This gives you the set of all recipes that cannot be made using only ingredients in C. Then, select all recipes that aren't in that set.

How to make this relation in PostgreSQL?

Hello.
As shown in the ER model, I want to create a relation between "Busses" and "Chauffeurs", where every entity in "Chauffeurs" must have at least one relation in "Certified", and every entity in "Busses" must have at least one relation in "Certified".
Though it was pretty easy to design the ER model, I can't seem to find a way of making this relation in PostgreSQL. Anybody got some ideas ?
Thanks
The solution should be database agnostic. If I understand you correctly, you probably want your certified table to look like:
CERTIFIED
id
bus_id
chauffer_id
...
...
The only solution I've been able to find is the notion of a single mandatory field in the parent table to represent the "at least one" and then storing the 2 or more relationships in the intersection table.
chauffeurs
chauffeur_id
chauffer_name
certified_bus_id (not null)
certified
chauffer_id
bus_id
busses
bus_id
bus_name
certified_chauffer_id (not null)
To get a list of all busses where a chauffer is certified becomes
select c.chauffer_name, b.bus_name
from chauffeurs c
inner join busses b on (b.bus_id = c.certified_bus_id)
UNION
select c.chauffer_name, b.bus_name
from chauffeurs c
inner join certified ct on (c.chauffeur_id = ct.chauffer_id)
inner join busses b on (ct.bus_id = b.bus_id)
The UNION (vs UNION ALL) takes care of deduplication with the values in certified.

Storing & Querying Heirarchical Data with Multiple Parent Nodes

I've been doing quite a bit of searching, but haven't been able to find many resources on the topic. My goal is to store scheduling data like you would find in a Gantt chart. So one example of storing the data might be:
Task Id | Name | Duration
1 Task A 1
2 Task B 3
3 Task C 2
Task Id | Predecessors
1 Null
2 Null
3 1
3 2
Which would have Task C waiting for both Task A and Task B to complete.
So my question is: What is the best way to store this kind of data and efficiently query it? Any good resources for this kind of thing? There is a ton of information about tree structures, but once you add in multiple parents it becomes hard to find info. By the way, I'm working with SQL Server and .NET for this task.
Your problem is related to the concept of relationship cardinality. All relationships have some cardinality, which expresses the potential number of instances on each side of the relationship that are members of it, or can participate in a single instance of the relationship. As an example, for people, (for most living things, I guess, with rare exceptions), the Parent-Child relationship has a cardinality of 2 to zero or many, meaning it takes two parents on the parent side, and there can be zero or many children (perhaps it should be 2 to 1 or many)
In database design, generally, anything that has a 1(one), (or a zero or one), on one side can be easily represented with just two tables, one for each entity, (sometimes only one table is needed see note**) and a foreign key column in the table representing the "many" side, that points to the other table holding the entity on the "one" side.
In your case you have a many to many relationship. (A Task can have multiple predecessors, and each predecessors can certainly be the predecessor for multiple tasks) In this case a third table is needed, where each row, effectively, represents an association between 2 tasks, representing that one is the predecessor to the other. Generally, This table is designed to contain only all the columns of the primary keys of the two parent tables, and it's own primary key is a composite of all the columns in both parent Primary keys. In your case it simply has two columns, the taskId, and the PredecessorTaskId, and this pair of Ids should be unique within the table so together they form the composite PK.
When querying, to avoid double counting data columns in the parent tables when there are multiple joins, simply base the query on the parent table... e.g., to find the duration of the longest parent,
Assuming your association table is named TaskPredecessor
Select TaskId, Max(P.Duration)
From Task T Join Task P
On P.TaskId In (Select PredecessorId
From TaskPredecessor
Where TaskId = T.TaskId)
** NOTE. In cases where both entities in the relationship are of the same entity type, they can both be in the same table. The canonical (luv that word) example is an employee table with the many to one relationship of Worker to Supervisor... Since the supervisor is also an employee, both workers and supervisors can be in the same [Employee] table, and the realtionship can gbe modeled with a Foreign Key (called say SupervisorId) that points to another row in the same table and contains the Id of the employee record for that employee's supervisor.
Use adjacency list model:
chain
task_id predecessor
3 1
3 2
and this query to find all predecessors of the given task:
WITH q AS
(
SELECT predecessor
FROM chain
WHERE task_id = 3
UNION ALL
SELECT c.predecessor
FROM q
JOIN chain c
ON c.task_id = q.predecessor
)
SELECT *
FROM q
To get the duration of the longest parent for each task:
WITH q AS
(
SELECT task_id, duration
FROM tasks
UNION ALL
SELECT t.task_id, t.duration
FROM q
JOIN chain с
ON c.task_id = q.task_id
JOIN tasks t
ON t.task_id = c.predecessor
)
SELECT task_id, MAX(duration)
FROM q
Check "Hierarchical Weighted total" pattern in "SQL design patterns" book, or "Bill Of Materials" section in "Trees and Hierarchies in SQL".
In a word, graphs feature double aggregation. You do one kind of aggregation along the nodes in each path, and another one across alternative paths. For example, find a minimal distance between the two nodes is minimum over summation. Hierarchical weighted total query (aka Bill Of Materials) is multiplication of the quantities along each path, and summation along each alternative path:
with TCAssembly as (
select Part, SubPart, Quantity AS factoredQuantity
from AssemblyEdges
where Part = ‘Bicycle’
union all
select te.Part, e.SubPart, e.Quantity * te.factoredQuantity
from TCAssembly te, AssemblyEdges e
where te.SubPart = e.Part
) select SubPart, sum(Quantity) from TCAssembly
group by SubPart

Weird many to many and one to many relationship

I know I'm gonna get down votes, but I have to make sure if this is logical or not.
I have three tables A, B, C. B is a table used to make a many-many relationship between A and C. But the thing is that A and C are also related directly in a 1-many relationship
A customer added the following requirement:
Obtain the information from the Table B inner joining with A and C, and in the same query relate A and C in a one-many relationship
Something like:
alt text http://img247.imageshack.us/img247/7371/74492374sa4.png
I tried doing the query but always got 0 rows back. The customer insists that I can accomplish the requirement, but I doubt it. Any comments?
PS. I didn't have a more descriptive title, any ideas?
UPDATE:
Thanks to rcar, In some cases this can be logical, in order to have a history of all the classes a student has taken (supposing the student can only take one class at a time)
UPDATE:
There is a table for Contacts, a table with the Information of each Contact, and the Relationship table. To get the information of a Contact I have to make a 1:1 relationship with Information, and each contact can have like and an address book with; this is why the many-many relationship is implemented.
The full idea is to obtain the contact's name and his address book.
Now that I got the customer's idea... I'm having trouble with the query, basically I am trying to use the query that jdecuyper wrote, but as he warns, I get no data back
This is a doable scenario. You can join a table twice in a query, usually assigning it a different alias to keep things straight.
For example:
SELECT s.name AS "student name", c1.className AS "student class", c2.className as "class list"
FROM s
JOIN many_to_many mtm ON s.id_student = mtm.id_student
JOIN c c1 ON s.id_class = c1.id_class
JOIN c c2 ON mtm.id_class = c2.id_class
This will give you a list of all students' names and "hardcoded" classes with all their classes from the many_to_many table.
That said, this schema doesn't make logical sense. From what I can gather, you want students to be able to have multiple classes, so the many_to_many table should be where you'd want to find the classes associated with a student. If the id_class entries used in table s are distinct from those in many_to_many (e.g., if s.id_class refers to, say, homeroom class assignments that only appear in that table while many_to_many.id_class refers to classes for credit and excludes homeroom classes), you're going to be better off splitting c into two tables instead.
If that's not the case, I have a hard time understanding why you'd want one class hardwired to the s table.
EDIT: Just saw your comment that this was a made-up schema to give an example. In other cases, this could be a sensible way to do things. For example, if you wanted to keep track of company locations, you might have a Company table, a Locations table, and a Countries table. The Company table might have a 1-many link to Countries where you would keep track of a company's headquarters country, but a many-to-many link through Locations where you keep track of every place the company has a store.
If you can give real information as to what the schema really represents for your client, it might be easier for us to figure out whether it's logical in this case or not.
Perhaps it's a lack of caffeine, but I can't conceive of a legitimate reason for wanting to do this. In the example you gave, you've got students, classes and a table which relates the two. If you think about what you want the query to do, in plain English, surely it has to be driven by either the student table or the class table. i.e.
select all the classes which are attended by student 1245235
select all the students which attend class 101
Can you explain the requirement better? If not, tell your customer to suck it up. Having a relationship between Students and Classes directly (A and C), seems like pure madness, you've already got table B which does that...
Bear in mind that the one-to-many relationship can be represented through the many-to-many, most simply by adding a field there to indicate the type of relationship. Then you could have one "current" record and any number of "history" ones.
Was the customer "requirement" phrased as given, by the way? I think I'd be looking to redefine my relationship with them if so: they should be telling me "what" they want (ideally what, in business domain language, their problem is) and leaving the "how" to me. If they know exactly how the thing should be implemented, then I'd be inclined to open the source code in an editor and leave them to it!
I'm supposing that s.id_class indicates the student's current class, as opposed to classes she has taken in the past.
The solution shown by rcar works, but it repeats the c1.className on every row.
Here's an alternative that doesn't repeat information and it uses one fewer join. You can use an expression to compare s.id_class to the current c.id_class matched via the mtm table.
SELECT s.name, c.className, (s.id_class = c.id_class) AS is_current
FROM s JOIN many_to_many AS mtm ON (s.id_student = mtm.id_student)
JOIN c ON (c.id_class = mtm.id_class);
So is_current will be 1 (true) on one row, and 0 (false) on all the other rows. Or you can output something more informative using a CASE construct:
SELECT s.name, c.className,
CASE WHEN s.id_class = c.id_class THEN 'current' ELSE 'past' END AS is_current
FROM s JOIN many_to_many AS mtm ON (s.id_student = mtm.id_student)
JOIN c ON (c.id_class = mtm.id_class);
It doesn't seem to make sense. A query like:
SELECT * FROM relAC RAC
INNER JOIN tableA A ON A.id_class = RAC.id_class
INNER JOIN tableC C ON C.id_class = RAC.id_class
WHERE A.id_class = B.id_class
could generate a set of data but inconsistent. Or maybe we are missing some important part of the information about the content and the relationships of those 3 tables.
I personally never heard a requirement from a customer that would sound like:
Obtain the information from the Table
B inner joining with A and C, and in
the same query relate A and C in a
one-many relationship
It looks like that it is what you translated the requirement to.
Could you specify the requirement in plain English, as what results your customer wants to get?