Efficient Database Structure for Deep Tree Data - sql

For a very big data database (more than a billion rows) where there is a very deep data tree, what is the most efficient structure? The read loading is the highest usage, but there are also changes to the tree on a regular basis.
There are several standard algorithms to represent a data tree. I have found this reference as part of the Mongodb manual to be an excellent summary: http://docs.mongodb.org/manual/tutorial/model-tree-structures/
My system has properties that do not map well to any of these cases. The issue is that the depth of the tree is so great that keeping "ancestors" or a "path" is very large. The tree also changes frequently enough that the "Nested Sets" approach is not efficient. I am considering a hybrid of the "Materialized Paths" and "Parent References" approach, where instead of the path I store a hash that is not guaranteed to be unique, but 90% of the time it is. Then the 10% of the time there is a collision, the parent reference resolves it. The idea is that the 90% of the time there is a fast query for the path hash. This idea is kind of like a bloom filter technique. But this is all for background: the question is in the first line of this post.

What I've done in the past with arbitrarily deep trees is just to store a Parent Key with each, as well as a sequence number which governs the order of children under a parent. I used RDBM's and this worked very efficiently. To arrange the tree structure after reading required code to arrange things properly - put each node in a Child collection in the nodes parent - but this in fact ran pretty fast.
It's a pretty naieve approach, in that there's nothing clever about it, but it does work for me.
The tree had about 300 or 400 members total and was I think 7 or 8 levels deep. This part of the system had no performance problems at all: it was very fast. The UI was a different matter, but that's another story.

Related

Database design for: Very hierarchical data; off-server subset caching for processing; small to moderate size; (complete beginner)

I found myself with a project (very relaxed, little to none consequences on failure) that I think a database of some sort is required to solve. The problem is, that while I'm still quite inexperienced in general, I've never touched any database beyond the tutorials I could dig up with Google and setting up your average home-cloud. I got myself stuck on not knowing what I do not know.
That's about the situation:
Several hundred different automated test-systems will write little amounts of data over a slow network into a database frequently. Few users, will then get large subsets of that data from the database over a slow network infrequently. The data will then be processed, which will require a large amount of reads, very high performance at this point is desired.
This will be the data (in order of magnitudes):
1000 products containing
10 variants containing
100 batches containing
100 objects containing
10 test-systems containing
100 test-steps containing
10 entries
It is basically a labeled B-tree with the test-steps as leave-nodes (since their format has been standardized).
A batch will always belong to one variant, a object will always belong to the same variant (but possibly multiple batches), and a variant will always belong to one product. There are hundreds of thousands of different test-steps.
Possible queries will try to get (e.g.):
Everything from a batch (optional: and the value of an entry within a range)
Everything from a variant
All test-steps of the type X and Y from a test-system with the name Z
As far as I can tell rows, hundreds of thousands columns wide (containing everything described above), do not seem like a good idea and neither do about a trillion rows (and the middle ground between the two still seems quite extreme).
I'd really like to leverage the hierarchical nature of the data, but all I found on e.g. something like nested databases is, that they're simply not a thing.
It'd be nice if you could help me with:
What to search for
What'd be a good approach to structure and store this data
Some place I can learn about avoiding the SQL horror stories even I've found plenty of
If there is a great way / best practice I should know of of transmitting the queried data and caching it locally for processing
Thank you and have a lovely day
Andreas
Search for "database normalization".
A normalized relational database is a fine structure.
If you want to avoid the horrors of SQL, you could also try a No-SQL Document-oriented Database, like MongoDB. I actually prefer this kind of database in a great many scenarios.
The database will cache your query results, and of course, whichever tool you use to query the database will cache the data in the tool's memory (or it will cache at least a subset of the query results if the number of results is very large). You can also write your results to a file. There are many ways to "cache", and they are all useful in different situations.

Is MPTT an overkill for maintaining a tree like database even if the depth is around 3 - 4?

I am planning to store some tree like data in MySQL.
Topics can have sub topics and they in turn can have more sub topics.
Is Modified Preorder Tree Traversal(MPTT) an over kill even if the maximum depth is around 3 - 4?
In any way you have to write model methods like get_children(), get_root(), is_root() and others. In some cases django-mptt will reduce queries to database. It is not overkill, it will save you a lot of time. django-mptt code is more reliable than yours will be, so you code will have less potential bugs. Just spend a few hours to read full docs=)

Travelling Salesman and Map/Reduce: Abandon Channel

This is an academic rather than practical question. In the Traveling Salesman Problem, or any other which involves finding a minimum optimization ... if one were using a map/reduce approach it seems like there would be some value to having some means for the current minimum result to be broadcast to all of the computational nodes in some manner that allows them to abandon computations which exceed that.
In other words if we map the problem out we'd like each node to know when to give up on a given partial result before it's complete but when it's already exceeded some other solution.
One approach that comes immediately to mind would be if the reducer had a means to provide feedback to the mapper. Consider if we had 100 nodes, and millions of paths being fed to them by the mapper. If the reducer feeds the best result to the mapper than that value could be including as an argument along with each new path (problem subset). In this approach the granularity is fairly rough ... the 100 nodes will each keep grinding away on their partition of the problem to completion and only get the new minimum with their next request from the mapper. (For a small number of nodes and a huge number of problem partitions/subsets to work across this granularity would be inconsequential; also it's likely that one could apply heuristics to the sequence in which the possible routes or problem subsets are fed to the nodes to get a rapid convergence towards the optimum and thus minimize the amount of "wasted" computation performed by the nodes).
Another approach that comes to mind would be for the nodes to be actively subscribed to some sort of channel, or multicast or even broadcast from which they could glean new minimums from their computational loop. In that case they could immediately abandon a bad computation when notified of a better solution (by one of their peers).
So, my questions are:
Is this concept covered by any terms of art in relation to existing map/reduce discussions
Do any of the current map/reduce frameworks provide features to support this sort of dynamic feedback?
Is there some flaw with this idea ... some reason why it's stupid?
that's a cool theme, that doesn't have that much literature, that was done on it before. So this is pretty much a brainstorming post, rather than an answer to all your problems ;)
So every TSP can be expressed as a graph, that looks possibly like this one: (taken it from the german Wikipedia)
Now you can run a graph algorithm on it. MapReduce can be used for graph processing quite well, although it has much overhead.
You need a paradigm that is called "Message Passing". It was described in this paper here: Paper.
And I blog'd about it in terms of graph exploration, it tells quite simple how it works. My Blogpost
This is the way how you can tell the mapper what is the current minimum result (maybe just for the vertex itself).
With all the knowledge in the back of the mind, it should be pretty standard to think of a branch and bound algorithm (that you described) to get to the goal. Like having a random start vertex and branching to every adjacent vertex. This causes a message to be send to each of this adjacents with the cost it can be reached from the start vertex (Map Step). The vertex itself only updates its cost if it is lower than the currently stored cost (Reduce Step). Initially this should be set to infinity.
You're doing this over and over again until you've reached the start vertex again (obviously after you visited every other one). So you have to somehow keep track of the currently best way to reach a vertex, this can be stored in the vertex itself, too. And every now and then you have to bound this branching and cut off branches that are too costly, this can be done in the reduce step after reading the messages.
Basically this is just a mix of graph algorithms in MapReduce and a kind of shortest paths.
Note that this won't yield to the optimal way between the nodes, it is still a heuristic thing. And you're just parallizing the NP-hard problem.
BUT a little self-advertising again, maybe you've read it already in the blog post I've linked, there exists an abstraction to MapReduce, that has way less overhead in this kind of graph processing. It is called BSP (Bulk synchonous parallel). It is more freely in the communication and it's computing model. So I'm sure that this can be a lot better implemented with BSP than MapReduce. You can realize these channels you've spoken about better with it.
I'm currently involved in an Summer of Code project which targets these SSSP problems with BSP. Maybe you want to visit if you're interested. This could then be a part solution, it is described very well in my blog, too. SSSP's in my blog
I'm excited to hear some feedback ;)
It seems that Storm implements what I was thinking of. It's essentially a computational topology (think of how each compute node might be routing results based on a key/hashing function to the specific reducers).
This is not exactly what I described, but might be useful if one had a sufficiently low-latency way to propagate current bounding (i.e. local optimum information) which each node in the topology could update/receive in order to know which results to discard.

When to choose Cassandra over a SQL/Semantic Store solution?

I have 30-40 GB of data and 3 developer machines (Core Duo i4, 3GB). The data is a set of graph like structures and I have queries that traverse the graphs. Is there a guideline that could help me to decide to use Cassandra or a classic solution, e.g., SQL or Semantic Store? My current plan is to set up Cassandra and see how does it work but I would like to learn more before starting the installation.
I would not use Cassandra for any kind of graph level structure. It has been about 6 months since I looked into doing something similar so maybe Cassandra has moved on since then but I found it was fundamentally limited by the fact that it only has row level indexes.
For a Graph based structure (assuming a simplistic one arc per row layout) you really need column indexes as well since if you want to traverse the graph you want to be able to start from a particular node A and find all the arcs that go from that node (assuming a directed Graph) then you'd have to do a row scan of the entire dataset as there is no built in functionality for saying give me the rows that have A in a particular column.
To achieve this you have to effectively design a data layout for Cassandra that gives you an inverted index. This is somewhat tricky and requires you to know ahead of time the type of queries that you want to answer - answering new types of queries at a later data may be very difficult or impossible if you don't design well. These slides demonstrate the idea but I hope it makes it clear that you effectively have to construct your own indexes.
For Graph structures that can be decomposed to triples consider an RDF store - for more complex structures then consider a full blown Graph Database. If you really want to do NoSQL you can probably build something on top of a document database as they tend to have much better indexing but again you'll have to think carefully about how you store your data.

Per frame optimization for large datasets

Summary
New to iPhone programming, I'm having trouble picking the right optimization strategy to filter a set of view components in a scrollview with huge content. In what area would my app gain the most performance?
Introduction
My current iPad app-in-progress let's users explore fairly large binary tree structures. The trees contain between 30 to 900 nodes, and when drawing inside a scrollview (with limited zoom) it looks like this.
The nodes' contents are stored in a SQLite backed Core Data model. It's a binary tree and if a node has children, there are always exactly two. The x and y positions are part of the model, as are the dimensions of the node connections, shown as dotted lines.
Optimization
Only about 50 nodes fit the screen at any given time. With the largest trees containing up to 900 nodes, it's not possible to put everything in a scrollview controlled and zooming UIView, that's a recipe for crashes. So I have to do per frame filtering of the nodes.
And that's where my troubles start. I don't have the experience to make a well founded decision between the possible filtering options, and in addition I probably don't know about that really fast special magic buried deep in Objective-C or Cocoa Touch. Because the backing store is close to 200 MB in size (some 90.000 nodes in hundreds of trees), it's very time consuming to test every single tree on the iPad device. Which is why I'd like to ask you guys for advice.
For all my attempts I'm putting a filter method in the scrollViewDidScroll: and scrollViewDidZoom:. I'm also blocking the main thread with the filter, because I can't show the content without the nodes anyway. But maybe someone has an idea in that area?
Because all the positioning is present in the Core Data model, I might use NSFetchRequest to do the filtering. Is that really fast though? I have the idea it's not a very optimized method.
From what I've tried, the faulted managed objects seem to fit in memory at once, but it might be tricky for the larger trees once their contents start firing faults. Is it a good idea to loop over the NSSet of nodes and see what items should be on screen?
Are there other tricks to gain performance? Would you see ways where I could use multi threading to get the display set faster, even though the model's context was created on the main thread?
Thanks for your advice,
EP.
Ironically your binary tree could be divided using Binary Space Partitioning done in 2D so rendering will be very fast performant and a number of check close to minimum necessary.