Store tree in SQL with fast transitive mean - sql

So, my problem is that I need to store a tree structure in an SQL database.
There are 2 types of nodes: NODES and LEAVES. Nodes store no data. Leaves store a single number.
Sometimes the new nodes and leaves may be inserted (it is okay to insert them in the middle of some hierarchy), other times they may get deleted (both leaves and nodes), they may also get updated (for example, switch parent node to another one ot get new data, for leaves).
My primary goal is to be able for each node to tell the mean value of its leaves (transitive included, i. e. leaves of child nodes, leaves of chld of child, e.t.c).
So far I have came up with multiple ideas, but I do not find them really efficient and maintainable in the means of inserting and deleting data:
Using PostgreSQL's ltree module. It allows for quick checking of a node belonging to some leaf, including, transitively. Using that we can select based on leaf being a child of a current node and then calculate a mean value. However, it seems to me, there may be issues with updating. For example, when a node switches its parent we will need to update every child node and leaf (including transitive ones) so we do not match them when, for example, try to search for leaves belonging to node before the previous parent of the switching node. (hope doesn't sound too complicated)
The second approach I've considered is using arrays in each node and leaf to store all its ancestry (like parent, grandparent, grand-grand, e.t.c). This, actually, shares the same problem considering switching parent. Moreover I have doubts about the perfomance of searching in arrays, since the hierarchy may become really deep).
Basic approaches like storing direct parent link / childs array seem not perfomant enough to me, since I will have to execute n-1 queries, where n is the depth of the tree
So now I'm wondering if there any other ideas, I have not thought of. I believe, there must be a better approach!

Related

SCIP: Children vs Parent vs Siblings

I am implementing a node selector. I was thinking that SCIPgetLeaves will give me the list of current nodes among which one needs to be selected for further branching. After the presolving stage, SCIPgetLeaves in NODESELSELECT() doesn't return any node. Instead, I had to use SCIPgetFocusNode().
The documentation states that the NODESELSELECT() chooses one of leaves, children and siblings, so I tried collecting all three. Is there a reason why children and siblings of the root node are not included in leaves after the presolving stage? Just trying to understand the design of SCIP.
All three node types relate to the focus node:
SCIPgetChildren() offers quick access to the children newly created via branching
SCIPgetSiblings() offers access to the sibling(s) of the focus node
SCIPgetLeaves() is the rest with more distant relationships to the focus node
Just keep in mind that with every selection, the open nodes are partitioned into the 3 types above.
The node solution process greatly benefits from the possibility of warm/hotstarting the dual Simplex Algorithm, which is why SCIP (and other solvers as well) mostly perform diving (also called "plunging") down the tree, with some limits. This requires quick access to the children of the focus node.
Have a look at src/scip/nodesel_dfs.c for a good example of a simple node selection.

Meaning of values in SCIP_NodeType

Is there a combination of parameter settings so that the search tree only contains "simple" node types, i.e. not SCIP_NODETYPE_{PROBINGNODE, DEADEND, JUNCTION, PSEUDOFORK, FORK, SUBROOT, REFOCUSNODE}? Even if it means disabling some functionality.
I'm also not really sure about what the different node types really mean, so any pointers to documentation would also be very useful.
The meaning of the different node types is explained here. If you have a closer look, you will understand that those types reflect the internal organization of the tree. They necessarily occur during tree search and cannot be skipped via parameters.
If you insist: setting limits/nodes = 1 will process only the root node of SCIP, and the tree will only consist of a focus node and its 2 children.

What's the best database for my data structure?

I have two data structures that I need to store in a database. At this point, I'm relatively sure that SQL and any relational database types wouldn't work, but I'm also not sure what alternatives I have and/or which of those alternatives would be best. If there is a reasonable way to implement these structures in mySQL or something similar, I'm open to the idea.
Structure 1:
A nested tree diagram, where nodes are not defined ahead of time, and are instead generated from the data. I have a lot of strings that I need to separate into trees such that each branch node on the tree is empty and each leaf node contains a maximum of 200 strings, all beginning with the same prefix. I would use SQL, but considering I will regularly have upwards of 9.45x10^55 nodes (branch and leaf), I can't use the tree traversal method; adding a single node would take too much time.
Structure 2:
I have an array of the leaf nodes from the above structure, however, every leaf node has its own data associated with it, yet not contained within it.
From my (extremely limited) understanding of SQL, the second structure can be implemented in mySQL or something similar. The problem is, I need to be able to retrieve individual nodes from the 2nd structure, instead of the entire array of nodes. Also, I don't know the length of the array ahead of time, so I can't simply make a table with a certain number of columns available for each node: I'd end up having over 9.09x10^55 columns, when I will regularly be only using 5 or less.
If you have any recommendations as to what kind of database I could use to implement these structures relatively easily, or any advice pertaining to the implementation itself, it would be greatly appreciated.

Why does no one store an array of child ids on nodes in a relational db tree structure?

I'm working on a hobby project which is a nested todo list app.
I started with the tree view example in the Redux repository (Javascript project), which uses a data structure which stores an array of child ids on each node. At first I found this strange, but having played with the structure for some time, IMO its easy to maintain and reason about.
Now I'm researching the best method to persist my todos using PostgreSQL. I've extensively read through the pros and cons of each solution here (What are the options for storing hierarchical data in a relational database?), and am about to settle on recursive WITH CTE's, as writes and updates take priority over reads in my app, but I thought I should ask.. is there a reason why storing child ids on the parent is not a popular solution for relational DB's?
It's most similar to the Adjacency List method, but with less recursion, and you get ordering (by maintaining order in your child_ids array). It also seems easier to reason about as you don't need to think inversely about parents while asking for a 'top down' tree.
I already have the logic in place to interactively move/update/etc nodes in Javascript using this structure, so it would be a big win if I could use the same logic to persist the data.
is there a reason why storing child ids on the parent is not a popular solution for relational DB's
This idea is very popular, with two caveats, and under certain modeling scenario:
Since database fields store a single "thing", so storing so storing a list "on the parent" means storing the list in a separate table related to parent by ID, and
Since a single child can belong to multiple parents, enforcing "one parent per child" policy becomes a lot harder.
In effect, DB's model of storing child IDs on the parent side stores IDs on nobody's side, because you can retrieve IDs of parents of a child with the same ease as retrieving children of a parent. That is why this approach is used when you model many-to-many relationship.
Note: Since enforcing one parent per child policy when you store parent ID on the child happens automatically, you can easily model your tree like that, while converting in-memory representation to an adjacency list for the tree.

A tree, where each node could have multiple parents

Here's a theoretical/pedantic question: imagine properties where each one could be owned by multiple others. Furthermore, from one iteration of ownership to the next, two neighboring owners could decide to partly combine ownership. For example:
territory 1, t=0: a,b,c,d
territory 2, t=0: e,f,g,h
territory 1, t=1: a,b,g,h
territory 2, t=1: g,h
That is to say, c and d no longer own property; and g and h became fat cats, so to speak.
I'm currently representing this data structure as a tree where each child could have multiple parents. My goal is to cram this into the Composite design pattern; but I'm having issues getting a conceptual footing on how the client might go back and update previous ownership without mucking up the whole structure.
My question is twofold.
Easy: What is a convenient name for this data structure such that I can google it myself?
Hard: What am I doing wrong? When I code I try to keep the mantra, "Keep it simple, Stupid," in my head, and I feel I am breaking this credo.
My question is two fold: Easy: What is a convenient name for this data
structure such that I can google it myself?
What you have here is not a tree, it is a graph. A multimap will help you here.
But any adjacency list or adjacency matrix will give you a good start.
Here is a video on adjacency matrix and list: Youtube on adjacency matrix and list
Hard: What am I doing wrong?
This is really hard to tell. Perhaps you did not model the relationship
in a proper way. It is not that hard, given a good datastructure to start with.
And, as you asked for design patterns (but you probably found out yourself),
the Composite pattern will let you model such an setting with ease.
You have a many-to-many relationship between your owners and your territories (properties). I'm not sure what language you're working in, but this sort of thing can be easily represented and tracked in a relational database. (You'd probably want a table for each entity, and the relationship would probably require a third "junction" table. If it's necessary to be able to query "back in time", this could have some sort of "time index" column as well.)
If you are working in an object-oriented language, you might create two classes, Territory and Owner, where the Territory class has a property/member/field which is a collection of references/pointers to Owners and the Owner class has a similar collection of Territories. (One of these two collections may need to contain "weak" references depending on the language.)
In this case, some difficulty may arise if you want to be able to go back and look at the network state at some particular point earlier in time. (If this is what you need, say so and I (or someone else) can post a solution that works for that.)
I'm not sure what level of simplicity you are striving for, but in neither of these cases is updating the ownership relationships really that "hard". Maybe if you posted some code it might be easier to give you more concrete advice.
Hard to tell without more information regarding the business rules. Though I've plenty of experience designing graphs where each node could potentially have numerous parents.
A common structure is the Directed Acyclic Graph. Essential rules here are that no path through the graph can cycle back onto itself. For example take the path "A/B/C/B", this would not be valid as B repeats twice.
Valid:- "A/B/C", "D/E/C", node C has two parents E and B.
Invalid:- "A/B/C/B", node B repeats in the same path causing a cycle.