What data structure should I use to track dependency?

What data structure should I use to track dependency? - sql

I have a bunch of tables in a relational database which, obviously, are dependent upon one another due to foreign key relationships. I want to build a dependency tree, traverse it, and output INSERT SQL statements. I need to first output SQL for foreign key tables in my dependency tree first because parent tables will depend on values from their foreign key identifier tables.
Does a binary tree, traversed in postorder, seem suitable for this task?

Take a look at the following:
Microsoft.SqlServer.Management.Smo.Server
Microsoft.SqlServer.Management.Smo.Database
Microsoft.SqlServer.Management.Smo.Scripter
Microsoft.SqlServer.Management.Smo.DependencyTree
Microsoft.SqlServer.Management.Smo.DependencyWalker
Microsoft.SqlServer.Management.Smo.DependencyCollection
Microsoft.SqlServer.Management.Smo.DependencyCollectionNode
There's examples on MSDN on how to use all this.
Essentially you want something like
Server server = new Server(SOURCESERVER);
Database database = server.Databases[SOURCEDATABASE];
Scripter sp = new Scripter(server);
...
UrnCollection col = new UrnCollection();
foreach (Table table in database.Tables)
{
col.Add(table.Urn);
}
....
DependencyTree tree = sp.DiscoverDependencies(col, DependencyType.Parents);
DependencyWalker walker = new DependencyWalker(server);
DependencyCollection depends = walker.WalkDependencies(tree);
//Iterate over each table in DB in dependent order...
foreach (DependencyCollectionNode dcn in depends)
...

If a table can be dependent on more than two tables, a binary tree will be insufficient.
Let table A be dependent on tables B, C and D. Then you would have to insert into B, C and D first, i.e. A should have three child nodes in your tree.
I think you need to use a more general tree structure which allows an arbitrary number of child nodes. Traversing this tree structure in post-order should yield the desired results, as you suggested.
Things will start to get messy when your dependency graph contains cycles and you need to defer constraint checking ;)

Related

Is there a way in JOOQ to pull a number of records without multiple DB calls?

In our webapp we have a number of places where you would be updating a number of tables in one complex form/view. In raw SQL I would probably select a bunch of columns from a bunch of tables and edit that one record on the primary table as well as related parent/child tables.
In hibernate I would probably just pull a JPA entity for the main table and let hibernate fetch the parent/child relationships as I populate the view. And then later pull from my view back to the entity and call entitymanger .perist/merge.
In JOOQ I have a number of options but it appears you can pull a main record via selectFrom/fetch then use fetchChild fetchParent to pull typed related records like so...
LoadsRecord load = dslContext.selectFrom(LOADS)
.where(LOADS.ID.eq(id))
.fetchOne();
SafetyInspectionsRecord safetyInspection = load.fetchParent(Keys.LOADS__FK_SAFETY_INSPECTION);
So this way I am able to pull related records in a typesafe manner. The only annoying thing is I have to run another full query every time I call fetchParent or fetchDhild. Is there a way to eagerly fetch these all at once to avoid multiple round trips to the DB?
It is really nice to have these classes like LoadsRecord for CRUD screens, it makes updating the DB easy.

Classic approach using joins
There are various ways you can achieve materialising a to-one relationship. The simplest one being a simple JOIN or LEFT JOIN if the relationship is optional.
E.g.:
Result<?> result =
ctx.select()
.from(LOADS)
.join(SAFETY_INSPECTIONS)
.on(LOADS.SAFETY_INSPECTIONS_ID.eq(SAFETY_INSPECTIONS.ID))
.fetch();
You probably want to work with the generated records thereafter, so you can use various mapping tools to map the generic Record types to the two UpdatableRecord types for further CRUD:
for (Record r : result) {
LoadsRecord loads = r.into(LOADS);
SafetyInspectionsRecord si = r.into(SAFETY_INSPECTIONS);
}
Using nested records
Starting from jOOQ 3.15 and #11812, MULTISET and ROW operators can be used to create nested collections and records. So, in your query, you could write:
Result<?> result =
ctx.select(
row(LOADS.ID, ...),
row(SAFETY_INSPECTIONS.ID, ...))
.from(LOADS)
.join(SAFETY_INSPECTIONS)
.on(LOADS.SAFETY_INSPECTIONS_ID.eq(SAFETY_INSPECTIONS.ID))
.fetch();
That would already help map the nested data structures into the desired format. Starting from jOOQ 3.17 and #4727, you can even use table expressions directly to generate nested records:
Result<Record2<LoadsRecord, SafetyInspectionsRecord>> result =
ctx.select(LOADS, SAFETY_INSPECTIONS)
.from(LOADS)
.join(SAFETY_INSPECTIONS)
.on(LOADS.SAFETY_INSPECTIONS_ID.eq(SAFETY_INSPECTIONS.ID))
.fetch();
This new feature is definitely going to close one of jOOQ's biggest gaps. You could even simplify the above using implicit joins to this:
Result<Record2<LoadsRecord, SafetyInspectionsRecord>> result =
ctx.select(LOADS, LOADS.safetyInspections())
.from(LOADS)
.fetch();

Table design for hierarchical data

i am trying to design a table which contains sections and each section contains tasks and each task contains sub tasks and so on. I would like to do it under one table. Please let me know the best single table approach which is scalable. I am pretty new to database design. Also please suggest if single table is not the best approach then what could be the best approach to do this. I am using db2.

Put quite simply, I would say use 1 table for tasks.
In addition to all its various other attributes, each task should have a primary identifier, and another column to optionally contain the identifier of its parent task.
If you are using DB2 for z/OS, then you will use a recursive query with a common table expression. Otherwise you you can use a hierarchical recursive query in DB2 for i, or possibly in DB2 for LUW (Linux, Unix, Windows).
Other designs requiring more tables, each specializing in a certain part of the task:subtask relationship, may needlessly introduce issues or limitations.

There are a few ways to do this.
One idea is to use two tables: Sections and Tasks
There could be a one to many relationship between the two. The Task table could be designed as a tree with a TaskId and a ParentTaksId which means you can have Tasks that go n-levels deep (sub tasks of sub tasks og sub tasks etc). Every Task except for the root task will have a parent.
I guess you can also solve this by using a single table where you just add a section column to the Task table I described above.

If you are going to put everything into one table although convenient will be inefficient in the long run. This would mean you will be storing unnecessary repeated groups of data in your database which would not be processor and memory friendly at all. It would in fact violate the Normalization rules and to be more specific the 1st Normal Form which says that there should be no repeating groups that could be found in your table. And it would actually also violate the 3rd Normal Form which means there will be no (transitional) dependency of a non-primary key to another non-primary key.
To give you an illustration, I will put your design into one table. Although I will be guessing on the possible fields but just bear with it because this is for the sake of discussion. Look at the graphics below:
If you look the graphics above (although this is rather small you could download the image and see it closer for yourself), the SectionName, Taskname, TaskInitiator, TaskStartDate and TaskEndDate are unnecessary repeated which as I mentioned earlier a violation of the 1st Normal Form.
Secondly, Taskname, TaskInitiator, TaskStartDate and TaskEndDate are functionally dependent on TaskID which is not a primary key instead of SectionID which in this case should be the primary key (if on a separate table). This is violation of 3rd Normal Form which says that there should be no Transitional Dependence or non-primary key should be dependent on
another non-primary key.
Although there are instances that you have to de-normalized but I believe this one should be normalized. In my own estimation there should be three tables involved in your design, namely, Sections,Tasks and SubTasks that would like the one below.
Section is related to Tasks, that is, a section could have many Tasks.
And Task is related to Sub-Tasks, that is, a Task could have many Sub-tasks.

If I understand correctly the original poster does not know, how many levels of hierarchy will be needed (hence "and so on"). His problem is to create a design that can hold a structure of any depth.
Imho that is a complex issue that does not have a single answer. When implementing such a design you need to count such factors as:
Will the structure be fairly constant? (How many writes?)
How often will this structure be read?
What operations will need to be possible? (Get all children objects of a given object? Get the parent object? Get the direct children?)
If the structure will be constant You could use the nested set model (http://en.wikipedia.org/wiki/Nested_set_model)
In this way the table has a 'left' and 'right' column. The parent object has its left and right column encompasing the values of any of its children object.
In that way you can list all the children of an object using a query like this:
SELECT child.id
FROM table AS parent
JOIN table AS child
ON child.left BETWEEN parent.left AND parent.right
AND child.right BETWEEN parent.left AND parent.right
WHERE
parent.id = #searchId
This design can be VERY fast to read, but is also EXTREMELY costly when the structure changes (for example when adding a child to any object You will have to update any object with a 'right' value that is higher than the inserted one).
If you need to be able to make changes to structure in real time you should probably use a design with two tables - one holding the objects, the second the structure (something like parentId, childId, differenceInHierarchyLevels).

How to design SQL tables to allow for multiple parent table options?

Looking into creating a series of tables in a reporting hierarchy and am more or less drawing a blank.
While tables in this structure can only have one parent, how do I structure the fields so that they point to the right parent table? As you'll see in my example below, a row's parent table can differ.
ARRANGEMENT
/ \
MATTERS ISSUES
| |
PHASES MATTERS
/ \ |
ISSUES TASKS PHASES
/ \ | / \
TITLES TASKS ISSUES TASKS TITLE
| | |
TITLES TITLES TITLE
Essentially, is it best to have each "branch" have a unique table (even though Tasks in branch 1 has the same data structure as branch 2 or 3), or is it best to have the records identify which table is their parent?
Arrangement(ID)
Matters(ParentTable, ParentID, ID)
Phases(ParentTable, ParentID, ID)
Issues(ParentTable,ParentID, ID)
Titles(ParentTable,ParentID, ID)
Tasks(ParentTable,ParentID, ID)
The above doesn't seem right to me at all. Help?

I have two opinions. You must be clear on the meaning of these "polymorphic" entities. Are they semantically different and separate? Even if they look the same, you may not want to put them into the same table if they serve different purposes.
If a Matter is truly, semantically fungible between Arrangements and Issues, then I'd suggest using a form of "Mapping" tables:
Arrangement(ID)
Matters(ID)
Issues(ID)
ArragementMatters (FK_ArrangementID, FK_MatterID)
IssueMatters (FK_IssueID, FK_MatterID)
You can continue this pattern throughout the "polymorphic" tables.
You can add a unique constraint on FK_MatterID columns if required.
It's easy to write queries:
Select * from Arrangement a
inner join ArrangementMatters am on am.FK_arrangementID = a.ID
inner join Matters m on m.ID = am.FK_matterID
to get all your Matters associated with Arrangements.
On the other hand, if a Matter under and Arrangement is NOT semantically fungible and only has the same schema, then I'd suggest creating completely separate tables:
Arrangement(ID)
ArrangementMatters(ID, FK_ArrangementID)
Issues(ID)
IssueMatters(ID, FK_IssueID)
This conveys the distinction to the world. It gives you a lot of benefits if you can separate your concerns. (i.e. Maybe you have lots of heavy usage of IssueMatters vs. ArrangementMatters - you can optimize the indexes and table layout independently.)
A query is also simpler:
Select * from Arrangement a
inner join Matters m on m.FK_arrangementID = a.ID

You can't really do that - at least not in any RDBMS I know of, if you use referential integrity (foreign key relationships), since those always have to reference one and exactly one parent table - you cannot have a FK relationship that reference parent table A in one case and parent table B in another.
In your concrete case, where a "Title" could be child of both "Phases" or "Tasks", one way to solve this would be to have a "dummy" Task for those "Title" entries that should be direct children of the "Phases" table.
Anything else will be a hack and a nightmare to maintain in the long run.
Marc

I have seen polymorphic associations of various sorts (not just parent relationships) handled that way in relational databases. I'd say go ahead and use the `ParentTable, ParentID' approach.
The downsides are that you won't be able to enforce referential integrity at the database level (ie, use foreign keys), and it's going to be a little more work to fetch association unless you're using a framework that will do the legwork for you (eg, Rails). If you really need polymorphic associations, though, I don't know any good way around those complications.

You could have 1 table with ID & ParentID.
The top level row (Arrangement) will have NULL ParentID.

Recommend SQL data model for Semantic Network nodes?

We're building a RDBMS-based web site for a federal semantic network (RDF, Protege, etc). This is basically a large collection of nodes, each having a large and indefinite set of named relationships to (and from) other nodes.
My first thought is a single table for all the nodes (name, description, etc), plus one table per named relationship. Any better ideas out there?

On further reflection, two tables total might do, one for nodes (id, name, description), and other for relations (id, name, description, from, to),
where from and two are ids in the nodes table (ints). Still on the right track?

You could optimize the performance by creating 2 rows per relation.
Let's say you have a table Items and a table Relations and that Person A has a relation with Person B. The Relations table has a left and right column, both referring to Items. Now, if you only have one row for this relation, and you want all relations for a certain Item, you would have a query looking like this:
SELECT * FROM Relations WHERE LeftItemId = #ItemId OR RightItemId = #ItemId
The OR in this query will ruin your performance! If you would duplicate the row and switch the relation (left becomes right and vice versa) the query looks like this:
SELECT * FROM Relations WHERE LeftItemId = #ItemId
With the right index this one will go blazingly fast.

No, that sould be fine. Pay attention to primary key and indexes, so that the performance is good.

If you didn't have a single table for the nodes, you'd have to define a lot of relation tables. Each new node type would require a new relation table with every old node type. That could get out of hand quickly.
So a single table sounds best. You can always use a 1:1 relation to extend it, if you need additional fields for certain node types.

if you're using sql server 2008, you might want to consider the new HierarchyID datatype to store your hierarchy in. It's optimized for storage.

Optimized SQL for tree structures

How would you get tree-structured data from a database with the best performance? For example, say you have a folder-hierarchy in a database. Where the folder-database-row has ID, Name and ParentID columns.
Would you use a special algorithm to get all the data at once, minimizing the amount of database-calls and process it in code?
Or would you use do many calls to the database and sort of get the structure done from the database directly?
Maybe there are different answers based on x amount of database-rows, hierarchy-depth or whatever?
Edit: I use Microsoft SQL Server, but answers out of other perspectives are interesting too.

It really depends on how you are going to access the tree.
One clever technique is to give every node a string id, where the parent's id is a predictable substring of the child. For example, the parent could be '01', and the children would be '0100', '0101', '0102', etc. This way you can select an entire subtree from the database at once with:
SELECT * FROM treedata WHERE id LIKE '0101%';
Because the criterion is an initial substring, an index on the ID column would speed the query.

Out of all the ways to store a tree in a RDMS the most common are adjacency lists and nested sets. Nested sets are optimized for reads and can retrieve an entire tree in a single query. Adjacency lists are optimized for writes and can added to with in a simple query.
With adjacency lists each node a has column that refers to the parent node or the child node (other links are possible). Using that you can build the hierarchy based on parent child relationships. Unfortunately unless you restrict your tree's depth you cannot pull the whole thing in one query and reading it is usually slower than updating it.
With the nested set model the inverse is true, reading is fast and easy but updates get complex because you must maintain the numbering system. The nested set model encodes both parentage and sort order by enumerating all of the nodes using a preorder based numbering system.
I've used the nested set model and while it is complex for read optimizing a large hierarchy it is worth it. Once you do a few exercises in drawing out the tree and numbering the nodes you should get the hang of it.
My research on this method started at this article: Managing Hierarchical Data in MySQL.

In the product I work on we have some tree structures stored in SQL Server and use the technique mentioned above to store a node's hierarchy in the record. i.e.
tblTreeNode
TreeID = 1
TreeNodeID = 100
ParentTreeNodeID = 99
Hierarchy = ".33.59.99.100."
[...] (actual data payload for node)
Maintaining the the hierarchy is the tricky bit of course and makes use of triggers. But generating it on an insert/delete/move is never recursive, because the parent or child's hierarchy has all the information you need.
you can get all of node's descendants thusly:
SELECT * FROM tblNode WHERE Hierarchy LIKE '%.100.%'
Here's the insert trigger:
--Setup the top level if there is any
UPDATE T
SET T.TreeNodeHierarchy = '.' + CONVERT(nvarchar(10), T.TreeNodeID) + '.'
FROM tblTreeNode AS T
INNER JOIN inserted i ON T.TreeNodeID = i.TreeNodeID
WHERE (i.ParentTreeNodeID IS NULL) AND (i.TreeNodeHierarchy IS NULL)
WHILE EXISTS (SELECT * FROM tblTreeNode WHERE TreeNodeHierarchy IS NULL)
BEGIN
--Update those items that we have enough information to update - parent has text in Hierarchy
UPDATE CHILD
SET CHILD.TreeNodeHierarchy = PARENT.TreeNodeHierarchy + CONVERT(nvarchar(10),CHILD.TreeNodeID) + '.'
FROM tblTreeNode AS CHILD
INNER JOIN tblTreeNode AS PARENT ON CHILD.ParentTreeNodeID = PARENT.TreeNodeID
WHERE (CHILD.TreeNodeHierarchy IS NULL) AND (PARENT.TreeNodeHierarchy IS NOT NULL)
END
and here's the update trigger:
--Only want to do something if Parent IDs were changed
IF UPDATE(ParentTreeNodeID)
BEGIN
--Update the changed items to reflect their new parents
UPDATE CHILD
SET CHILD.TreeNodeHierarchy = CASE WHEN PARENT.TreeNodeID IS NULL THEN '.' + CONVERT(nvarchar,CHILD.TreeNodeID) + '.' ELSE PARENT.TreeNodeHierarchy + CONVERT(nvarchar, CHILD.TreeNodeID) + '.' END
FROM tblTreeNode AS CHILD
INNER JOIN inserted AS I ON CHILD.TreeNodeID = I.TreeNodeID
LEFT JOIN tblTreeNode AS PARENT ON CHILD.ParentTreeNodeID = PARENT.TreeNodeID
--Now update any sub items of the changed rows if any exist
IF EXISTS (
SELECT *
FROM tblTreeNode
INNER JOIN deleted ON tblTreeNode.ParentTreeNodeID = deleted.TreeNodeID
)
UPDATE CHILD
SET CHILD.TreeNodeHierarchy = NEWPARENT.TreeNodeHierarchy + RIGHT(CHILD.TreeNodeHierarchy, LEN(CHILD.TreeNodeHierarchy) - LEN(OLDPARENT.TreeNodeHierarchy))
FROM tblTreeNode AS CHILD
INNER JOIN deleted AS OLDPARENT ON CHILD.TreeNodeHierarchy LIKE (OLDPARENT.TreeNodeHierarchy + '%')
INNER JOIN tblTreeNode AS NEWPARENT ON OLDPARENT.TreeNodeID = NEWPARENT.TreeNodeID
END
one more bit, a check constraint to prevent a circular reference in tree nodes:
ALTER TABLE [dbo].[tblTreeNode] WITH NOCHECK ADD CONSTRAINT [CK_tblTreeNode_TreeNodeHierarchy] CHECK
((charindex(('.' + convert(nvarchar(10),[TreeNodeID]) + '.'),[TreeNodeHierarchy],(charindex(('.' + convert(nvarchar(10),[TreeNodeID]) + '.'),[TreeNodeHierarchy]) + 1)) = 0))
I would also recommend triggers to prevent more than one root node (null parent) per tree, and to keep related nodes from belonging to different TreeIDs (but those are a little more trivial than the above.)
You'll want to check for your particular case to see if this solution performs acceptably. Hope this helps!

Celko wrote about this (2000):
http://www.dbmsmag.com/9603d06.html
http://www.intelligententerprise.com/001020/celko1_1.jhtml;jsessionid=3DFR02341QLDEQSNDLRSKHSCJUNN2JVN?_requestid=32818
and other people asked:
Joining other tables in oracle tree queries
How to calculate the sum of values in a tree using SQL
How to store directory / hierarchy / tree structure in the database?
Performance of recursive stored procedures in MYSQL to get hierarchical data
What is the most efficient/elegant way to parse a flat table into a tree?
finally, you could look at the rails "acts_as_tree" (read-heavy) and "acts_as_nested_set" (write-heavy) plugins. I don't ahve a good link comparing them.

There are several common kinds of queries against a hierarchy. Most other kinds of queries are variations on these.
From a parent, find all children.
a. To a specific depth. For example, given my immediate parent, all children to a depth of 1 will be my siblings.
b. To the bottom of the tree.
From a child, find all parents.
a. To a specific depth. For example, my immediate parent is parents to a depth of 1.
b. To an unlimited depth.
The (a) cases (a specific depth) are easier in SQL. The special case (depth=1) is trivial in SQL. The non-zero depth is harder. A finite, but non-zero depth, can be done via a finite number of joins. The (b) cases, with indefinite depth (to the top, to the bottom), are really hard.
If you tree is HUGE (millions of nodes) then you're in a world of hurt no matter what you try to do.
If your tree is under a million nodes, just fetch it all into memory and work on it there. Life is much simpler in an OO world. Simply fetch the rows and build the tree as the rows are returned.
If you have a Huge tree, you have two choices.
Recursive cursors to handle the unlimited fetching. This means the maintenance of the structure is O(1) -- just update a few nodes and you're done. However fetching is O(n*log(n)) because you have to open a cursor for each node with children.
Clever "heap numbering" algorithms can encode the parentage of each node. Once each node is properly numbered, a trivial SQL SELECT can be used for all four types of queries. Changes to the tree structure, however, require renumbering the nodes, making the cost of a change fairly high compared to the cost of retrieval.

If you have many trees in the database, and you will only ever get the whole tree out, I would store a tree ID (or root node ID) and a parent node ID for each node in the database, get all the nodes for a particular tree ID, and process in memory.
However if you will be getting subtrees out, you can only get a subtree of a particular parent node ID, so you either need to store all parent nodes of each node to use the above method, or perform multiple SQL queries as you descend into the tree (hope there are no cycles in your tree!), although you can reuse the same Prepared Statement (assuming that nodes are of the same type and are all stored in a single table) to prevent re-compiling the SQL, so it might not be slower, indeed with database optimisations applied to the query it could be preferable. Might want to run some tests to find out.
If you are only storing one tree, your question becomes one of querying subtrees only, and the second answer applied.

Google for "Materialized Path" or "Genetic Trees"...

In Oracle there is SELECT ... CONNECT BY statement to retrieve trees.

I am a fan of the simple method of storing an ID associated with its parentID:
ID ParentID
1 null
2 null
3 1
4 2
... ...
It is easy to maintain, and very scalable.

This article is interesting as it shows some retrieval methods as well as a way to store the lineage as a derived column. The lineage provides a shortcut method to retrieve the hierarchy without too many joins.

Not going to work for all situations, but for example given a comment structure:
ID | ParentCommentID
You could also store TopCommentID which represents the top most comment:
ID | ParentCommentID | TopCommentID
Where the TopCommentID and ParentCommentID are null or 0 when it's the topmost comment. For child comments, ParentCommentID points to the comment above it, and TopCommentID points to the topmost parent.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas