Hierarchy Representation for finding leaves - sql

I have a number of hierarchies (trees) I am trying to represent in a relational database. The most important operation I will perform is: given a root node, find the leaves of the tree. Which hierarchy representation is best suited for this operation?

Related

What are some practical applications of taking the vertical sum of a binary tree

I came across this interview question and was wondering why you would want to take the vertical sum of a binary tree. Is this algorithm useful?
http://www.careercup.com/question?id=2375661
For a balanced tree the vertical sum could give you a rough insight into the range of the data. Binary trees, although easier to code, can take on more pathological shapes depending on the order in which the data is inserted. The vertical sum would be a good indicator of that pathology.
Look at the code at vertical sum in binary tree. This algorithm is written assuming a max width for the tree. Using this algorithm you will be able to get a feel for different types of unbalanced trees.
An interesting variation of this program would be to use permutations of a fixed data set to build binary trees and look at the various vertical sums. The leading and trailing zeroes give you a feel for how the tree is balanced, and the vertical sums can give you insight into how data arrival order can affect the height of the tree (And the average access time for the data in the tree). An internet search will return an implementation of this algorithm using dynamic data structures. With these I think you would want to document which sum included the root node.
Your question "Is this algorithm useful?" actually begs the question of how useful is a binary tree compared to a balanced tree. The vertical sum of a tree documents whether the implementation is closer to O(N) or O(log N). Here is an article on [balanced binary trees][3]. Put a balanced tree implementation in your personal toolkit, and try to remember if you would use a pre-order, in-order, or post-order traversal of the tree to calculate your vertical sum. You'll get an A+ for this question.

How does one come to the conclusion that the data is "Unrelational"

What sort of pointers, facets, characteristics or structures should one analyse to determine that a given colletion of objects fits bets in a relational store or a Non-relational one? and further, how does one determine that he should use a document store over a KVP store?!?

Efficient Database Structure for Deep Tree Data

For a very big data database (more than a billion rows) where there is a very deep data tree, what is the most efficient structure? The read loading is the highest usage, but there are also changes to the tree on a regular basis.
There are several standard algorithms to represent a data tree. I have found this reference as part of the Mongodb manual to be an excellent summary: http://docs.mongodb.org/manual/tutorial/model-tree-structures/
My system has properties that do not map well to any of these cases. The issue is that the depth of the tree is so great that keeping "ancestors" or a "path" is very large. The tree also changes frequently enough that the "Nested Sets" approach is not efficient. I am considering a hybrid of the "Materialized Paths" and "Parent References" approach, where instead of the path I store a hash that is not guaranteed to be unique, but 90% of the time it is. Then the 10% of the time there is a collision, the parent reference resolves it. The idea is that the 90% of the time there is a fast query for the path hash. This idea is kind of like a bloom filter technique. But this is all for background: the question is in the first line of this post.
What I've done in the past with arbitrarily deep trees is just to store a Parent Key with each, as well as a sequence number which governs the order of children under a parent. I used RDBM's and this worked very efficiently. To arrange the tree structure after reading required code to arrange things properly - put each node in a Child collection in the nodes parent - but this in fact ran pretty fast.
It's a pretty naieve approach, in that there's nothing clever about it, but it does work for me.
The tree had about 300 or 400 members total and was I think 7 or 8 levels deep. This part of the system had no performance problems at all: it was very fast. The UI was a different matter, but that's another story.

Best practice for storing a neural network in a database

I am developing an application that uses a neural network. Currently I am looking at either trying to put it into a relational database based on SQL (probably SQL server) or a graph database.
From a performance viewpoint, the neural net will be very large.
My questions:
Do relational databases suffer a performance hit when dealing with a neural net in comparison to graph databases?
What graph-database technology would be best suited to dealing with a large neural net?
Can a geospatial database such as PostGIS be used to represent a neural net efficiently?
That depends on the intent of progress on the model.
Do you have a fixated idea on an immutable structure of the network? Like a Kohonnen map. Or an off-the-shelf model.
Do you have several relationship structures you need to test out, so that you wish to be able flip a switch to alternate between various structures.
Does your model treat the nodes as fluid automatons, free to seek their own neighbours? Where each automaton develops unique characteristic values of a common set of parameters, and you need to analyse how those values affect their "choice" of neighbours.
Do you have a fixed set of parameters for a fixed number of types/classes of nodes? Or is a node expected to develop a unique range of attributes and relationships?
Do you have frequent need to access each node, especially those embedded deep in the network layers, to analyse and correlate them?
Is your network perceivable as, or quantizable into, set of state-machines?
Disclaimer
First of all, I need to disclaim that I am familiar only with Kohonnen maps. (So, I admit having been derided for Kohonnen as being only entry-level of anything barely neural-network.) The above questions are the consequence of personal mental exploits I've had over the years fantasizing after random and lowly-educated reading of various neural shemes.
Category vs Parameter vs Attribute
Can we class vehicles by the number of wheels or tonnage? Should wheel-quantity or tonnage be attributes, parameters or category-characteristics.
Understanding this debate is a crucial step in structuring your repository. This debate is especially relevant to disease and patient vectors. I have seen patient information relational schemata, designed by medical experts but obviously without much training in information science, that presume a common set of parameters for every patient. With thousands of columns, mostly unused, for each patient record. And when they exceed column limits for a table, they create a new table with yet thousands more of sparsely used columns.
Type 1: All nodes have a common set of parameters and hence a node can be modeled into a table with a known number of columns.
Type 2: There are various classes of nodes. There is a fixed number of classes of nodes. Each class has a fixed set of parameters. Therefore, there is a characteristic table for each class of node.
Type 3: There is no intent to pigeon-hole the nodes. Each node is free to develop and acquire its own unique set of attributes.
Type 4: There are fixed number of classes of nodes. Each node within a class is free to develop and acquire its own unique set of attributes. Each class has a restricted set of attributes a node is allowed to acquire.
Read on EAV model to understand the issue of parameters vs attributes. In an EAV table, a node needs only three characterising columns:
node id
attribute name
attribute value
However, under constraints of technology, an attribute could be number, string, enumerable or category. Therefore, there would be four more attribute tables, one for each value type, plus the node table:
node id
attriute type
attribute name
attribute value
Sequential/linked access versus hashed/direct-address access
Do you have to access individual nodes directly rather than traversing the structural tree to get to a node quickly?
Do you need to find a list of nodes that have acquired a particular trait (set of attributes) regardless of where they sit topologically on the network? Do you need to perform classification (aka principal component analysis) on the nodes of your network?
State-machine
Do you wish to perceive the regions of your network as a collection of state-machines?
State machines are very useful quantization entities. State-machine quatization helps you to form empirical entities over a range of nodes based on neighbourhood similarities and relationships.
Instead of trying to understand and track individual behaviour of millions of nodes, why not clump them into regions of similarity. And track the state-machine flow of those regions.
Conclusion
This is my recommendation. You should start initially using a totally relational database. The reason is that relational database and the associated SQL provides information with a very liberal view of relationship. With SQL on a relational model, you could inquire or correlate relationships that you did not know exist.
As your experiments progress and you might find certain relationship modeling more suitable to a network-graph repository, you should then move those parts of the schema to such suitable repository.
In the final state of affairs. I would maintain a dual mode information repo. You maintain a relational repo to keep track of nodes and their attributes. So you store the dynamically mutating structure in a network-graph repository but each node refers to a node id in a relational database. Where the relational database allows you to query nodes based on attributes and their values. For example,
SELECT id FROM Nodes a, NumericAttributes b
WHERE a.attributeName = $name
AND b.value WItHIN $range
AND a.id = b.id
I am thinking, perhaps, hadoop could be used instead of a traditional network-graph database. But, I don't know how well hadoop adapts to dynamically changing relationships. My understanding is that hadoop is good for write-once read-by-many. However, a dynamic neural network may not perform well in frequent relationship changes. Whereas, a relational table modeling network relationships is not efficient.
Still, I believe I have only exposed questions you need to consider rather than providing you with a definite answer, especially with a rusty knowledge on many concepts.
Trees can be stored in a table by using self-referencing foreign keys. I'm assuming the only two things that need to be stored are topology and the weights; both of these can be stored in a flattened tree structure. Of course, this can require a lot of recursive selects, which depending on your RDBMS may be a pain to implement natively (thus requiring many SQL queries to achieve). I cannot comment on the comparison, but hopefully that helps with the relational point of view :)

Where should I begin with this database design?

I have 5 tables all unnormalised and I need to create an ER model, a logical model, normalise the data and also a bunch of queries.
Where would you begin? Would you normalise the data first? Create the ER model and the relationships?
There are two ways to start data modelling: top-down and bottom-up.
The top-down approach is to ask what things (tangible and intangible) are important to your system. These things become your entities. You then go through the process of figuring out how your entities are related to each other (your relationships) and then flesh out the entities with attributes. The result is your conceptual or logical model. This can be represented in ERD form if you like.
Either along the way or after your entities, relationships and attributes are defined, you go through the normalization process and make other implementation decisions to arrive at your physical model - which can also be represented as an ERD.
The bottom-up approach is to take your existing relations - i.e. whatever screens, reports, datastores, or whatever existing data representations you have and then perform a canonical synthesis to reduce the entire set of data representation into a single, coherent, normalized model. This is done by normalizing each of your views of data and looking for commonalities that let you bring items together into a single model.
Which approach you use depends a little bit on personal choice, and quite a bit on whether you have existing data views to start from.
I think you should first prepare the list of entities and attributes. so that you will be able to get the complete details of the data information.
Once you are clear with the data information. You can start creating the master table and Normalize then.
Then after the complete database is design with normalization, You can create the ER diagram very easily.
I would start by evaluating and then preparing the list of entites and attributes within your data.
I would do it in this order.
Relationships
Create the ER model.
Normalise the data.
I know many others will have a different opinion but this is the way I would go ahead with it :)