How to deal with crashes in the middle of a B+ tree operation? - indexing

In between B+ tree operations, the tree is in a valid state, in the sense that the properties of a B+ tree are satisfied. However, during an operation on the B+ tree, those same properties will probably not be satisfied (at least not all of them). So if the process that operates on the B+ tree crashes or is killed, or the computer dies (maybe there is a power outage), when a tree operation is going on and the tree therefore is in an invalid state, how do you get the tree into a valid state again? That is, how do you recover from messing the tree up by only performing part of an operation?
Edit: I wrote this question with B+ trees in mind, but I'm really wondering how to do this for any tree data structure that uses the disk, so an answer of how to do this for B-trees would also be fine since B+ trees and B-trees are so similar.

Related

Genetic Algorithm: Need some clarification on selection and what to do when crossover doesn't happen

I'm writing a genetic algorithm to minimize a function. I have two questions, one in regards to selection and the other with regards to crossover and what to do when it doesn't happen.
Here's an outline of what I'm doing:
while (number of new population < current population)
# Evaluate all fitnesses and give them a rank. Choose individual based on rank (wheel roulette) to get first parent.
# Do it again to get second parent, ensuring parent1 =/= parent2
# Elitism (do only once): choose the fittest individual and immediately copy to new generation
Multi-point crossover: 50% chance
if (crossover happened)
do single point mutation on child (0.75%)
else
pick random individual to be copied into new population.
end
And all of this is under another while loop which tracks fitness progression and number of iterations, which I didn't include. So, my questions:
As you can see, two parents are chosen randomly in each
iteration until the new population is filled up. So, the two same
parents may mate more than once and surely several fit parents will
mate many more times than once. Is this in any way bad?
In the obitko tutorial, it says if crossover doesn't
happen, then child is exact copy of parents. I don't even understand
what that means, so, as you can see, I just picked a random parent
(uniformly; no fitness considered) and copied to new population.
This seems weird to me. Whether I actually do this or not, my results really don't change that much. What's the proper way to handle the case
when crossover doesn't happen?
Some parents having several offspring is common; I'd even say this is the default practice (and consider biological evolution, where precisely this is one of the main ingredients).
"If crossover doesn't happen, then child is exact copy of parents"
That is a bit confusing. Crossover (well explained in your link) means taking some genes from one parent and some from the other. This is called sexual reproduction and requires two (or more?) parents.
But asexual reproduction is also possible. In this case, you simply take one parent and mutate its genome in the new individual. This is almost what you were attempting, but you are missing the important mutation step (note mutations can be very aggressive or very conservative!)
Note that asexual reproduction requires mutation after copying the genome to create diversity, while in sexual reproduction this is an optional step.
It is fine to use either type of reproduction, or a mix of them. By the way: in some problems genes might not always have the same size. Sexual reproduction is problematic in this case. If you are interested in this problem, take a look at the NEAT algorithm, a popular neuroevolution algorithm designed to address this (wiki and paper).
Finally, elitism (copying the best-performing individuals to the next generation) is common, but it may be problematic. Genetic algorithms often stall in sub-optimal solutions (called local maxima, where any changes decrease fitness). Elitism can contribute to this problem. Of course, the opposite problem is too much diversity being similar to random search, so you need to find the right balance.
I don't see anything wrong with the same individual being the parent of more than one child per generation. It can only affect your diversity a little bit. If you don't like this, or find a real lack of diversity at the final generations, you can actually flag the individual so it cannot be parent more than once per generation.
I actually don't fully agree with the tutorial, I think after you have selected the individuals that will become parents (based on their fitness, of course) you should actually perform the crossover. Otherwise you will be cloning a lot of individuals to the next generation.

Hierarchy Representation for finding leaves

I have a number of hierarchies (trees) I am trying to represent in a relational database. The most important operation I will perform is: given a root node, find the leaves of the tree. Which hierarchy representation is best suited for this operation?

Efficient Database Structure for Deep Tree Data

For a very big data database (more than a billion rows) where there is a very deep data tree, what is the most efficient structure? The read loading is the highest usage, but there are also changes to the tree on a regular basis.
There are several standard algorithms to represent a data tree. I have found this reference as part of the Mongodb manual to be an excellent summary: http://docs.mongodb.org/manual/tutorial/model-tree-structures/
My system has properties that do not map well to any of these cases. The issue is that the depth of the tree is so great that keeping "ancestors" or a "path" is very large. The tree also changes frequently enough that the "Nested Sets" approach is not efficient. I am considering a hybrid of the "Materialized Paths" and "Parent References" approach, where instead of the path I store a hash that is not guaranteed to be unique, but 90% of the time it is. Then the 10% of the time there is a collision, the parent reference resolves it. The idea is that the 90% of the time there is a fast query for the path hash. This idea is kind of like a bloom filter technique. But this is all for background: the question is in the first line of this post.
What I've done in the past with arbitrarily deep trees is just to store a Parent Key with each, as well as a sequence number which governs the order of children under a parent. I used RDBM's and this worked very efficiently. To arrange the tree structure after reading required code to arrange things properly - put each node in a Child collection in the nodes parent - but this in fact ran pretty fast.
It's a pretty naieve approach, in that there's nothing clever about it, but it does work for me.
The tree had about 300 or 400 members total and was I think 7 or 8 levels deep. This part of the system had no performance problems at all: it was very fast. The UI was a different matter, but that's another story.

Travelling Salesman and Map/Reduce: Abandon Channel

This is an academic rather than practical question. In the Traveling Salesman Problem, or any other which involves finding a minimum optimization ... if one were using a map/reduce approach it seems like there would be some value to having some means for the current minimum result to be broadcast to all of the computational nodes in some manner that allows them to abandon computations which exceed that.
In other words if we map the problem out we'd like each node to know when to give up on a given partial result before it's complete but when it's already exceeded some other solution.
One approach that comes immediately to mind would be if the reducer had a means to provide feedback to the mapper. Consider if we had 100 nodes, and millions of paths being fed to them by the mapper. If the reducer feeds the best result to the mapper than that value could be including as an argument along with each new path (problem subset). In this approach the granularity is fairly rough ... the 100 nodes will each keep grinding away on their partition of the problem to completion and only get the new minimum with their next request from the mapper. (For a small number of nodes and a huge number of problem partitions/subsets to work across this granularity would be inconsequential; also it's likely that one could apply heuristics to the sequence in which the possible routes or problem subsets are fed to the nodes to get a rapid convergence towards the optimum and thus minimize the amount of "wasted" computation performed by the nodes).
Another approach that comes to mind would be for the nodes to be actively subscribed to some sort of channel, or multicast or even broadcast from which they could glean new minimums from their computational loop. In that case they could immediately abandon a bad computation when notified of a better solution (by one of their peers).
So, my questions are:
Is this concept covered by any terms of art in relation to existing map/reduce discussions
Do any of the current map/reduce frameworks provide features to support this sort of dynamic feedback?
Is there some flaw with this idea ... some reason why it's stupid?
that's a cool theme, that doesn't have that much literature, that was done on it before. So this is pretty much a brainstorming post, rather than an answer to all your problems ;)
So every TSP can be expressed as a graph, that looks possibly like this one: (taken it from the german Wikipedia)
Now you can run a graph algorithm on it. MapReduce can be used for graph processing quite well, although it has much overhead.
You need a paradigm that is called "Message Passing". It was described in this paper here: Paper.
And I blog'd about it in terms of graph exploration, it tells quite simple how it works. My Blogpost
This is the way how you can tell the mapper what is the current minimum result (maybe just for the vertex itself).
With all the knowledge in the back of the mind, it should be pretty standard to think of a branch and bound algorithm (that you described) to get to the goal. Like having a random start vertex and branching to every adjacent vertex. This causes a message to be send to each of this adjacents with the cost it can be reached from the start vertex (Map Step). The vertex itself only updates its cost if it is lower than the currently stored cost (Reduce Step). Initially this should be set to infinity.
You're doing this over and over again until you've reached the start vertex again (obviously after you visited every other one). So you have to somehow keep track of the currently best way to reach a vertex, this can be stored in the vertex itself, too. And every now and then you have to bound this branching and cut off branches that are too costly, this can be done in the reduce step after reading the messages.
Basically this is just a mix of graph algorithms in MapReduce and a kind of shortest paths.
Note that this won't yield to the optimal way between the nodes, it is still a heuristic thing. And you're just parallizing the NP-hard problem.
BUT a little self-advertising again, maybe you've read it already in the blog post I've linked, there exists an abstraction to MapReduce, that has way less overhead in this kind of graph processing. It is called BSP (Bulk synchonous parallel). It is more freely in the communication and it's computing model. So I'm sure that this can be a lot better implemented with BSP than MapReduce. You can realize these channels you've spoken about better with it.
I'm currently involved in an Summer of Code project which targets these SSSP problems with BSP. Maybe you want to visit if you're interested. This could then be a part solution, it is described very well in my blog, too. SSSP's in my blog
I'm excited to hear some feedback ;)
It seems that Storm implements what I was thinking of. It's essentially a computational topology (think of how each compute node might be routing results based on a key/hashing function to the specific reducers).
This is not exactly what I described, but might be useful if one had a sufficiently low-latency way to propagate current bounding (i.e. local optimum information) which each node in the topology could update/receive in order to know which results to discard.

Object Oriented implementation of graph data structures

I have been reading quite a bit graph data structures lately, as I have intentions of writing my own UML tool. As far as I can see, what I want can be modeled as a simple graph consisting of vertices and edges. Vertices will have a few values, and will so best be represented as objects. Edges does not, as far as I can see, need to be neither directed or weighted, but I do not want to choose an implementation that makes it impossible to include such properties later on.
Being educated in pure object oriented programming, the first things that comes to my mind is representing vertices and edges by classes, like for example:
Class: Vertice
- Array arrayOfEdges;
- String name;
Class: Edge
- Vertice from;
- Vertice to;
This gives me the possibility to later introduce weights, direction, and so on. Now, when I read up on implementing graphs, it seems that this is a very uncommon solution. Earlier questions here on Stack Overflow suggests adjacency lists and adjacency matrices, but being completely new to graphs, I have a hard time understanding why that is better than my approach.
The most important aspects of my application is having the ability to easily calculate which vertice is clicked and moved, and the ability to add and remove vertices and edges between the vertices. Will this be easier to accomplish in one implementation over another?
My language of choice is Objective-C, but I do not believe that this should be of any significance.
Here are the two basic graph types along with their typical implementations:
Dense Graphs:
Adjacency Matrix
Incidence Matrix
Sparse Graphs:
Adjacency List
Incidence List
In the graph framework (closed source, unfortunately) that I've ben writing (>12k loc graph implementations + >5k loc unit tests and still counting) I've been able to implement (Directed/Undirected/Mixed) Hypergraphs, (Directed/Undirected/Mixed) Multigraphs, (Directed/Undirected/Mixed) Ordered Graphs, (Directed/Undirected/Mixed) KPartite Graphs, as well as all kinds of Trees, such as Generic Trees, (A,B)-Trees, KAry-Trees, Full-KAry-Trees, (Trees to come: VP-Trees, KD-Trees, BKTrees, B-Trees, R-Trees, Octrees, …).
And all without a single vertex or edge class. Purely generics. And with little to no redundant implementations**
Oh, and as if this wasn't enough they all exist as mutable, immutable, observable (NSNotification), thread-unsafe and thread-safe versions.
How? Through excessive use of Decorators.
Basically all graphs are mutable, thread-unsafe and not observable. So I use Decorators to add all kinds of flavors to them (resulting in no more than 35 classes, vs. 500+ if implemented without decorators, right now).
While I cannot give any actual code, my graphs are basically implemented via Incidence Lists by use of mainly NSMutableDictionaries and NSMutableSets (and NSMutableArrays for my ordered Trees).
My Undirected Sparse Graph has nothing but these ivars, e.g.:
NSMutableDictionary *vertices;
NSMutableDictionary *edges;
The ivar vertices maps vertices to adjacency maps of vertices to incident edges ({"vertex": {"vertex": "edge"}})
And the ivar edges maps edges to incident vertex pairs ({"edge": {"vertex", "vertex"}}), with Pair being a pair data object holding an edge's head vertex and tail vertex.
Mixed Sparse Graphs would have a slightly different mapping of adjascency/incidence lists and so would Directed Sparse Graphs, but you should get the idea.
A limitation of this implementation is, that both, every vertex and every edge needs to have an object associated with it. And to make things a bit more interesting(sic!) each vertex object needs to be unique, and so does each edge object. This is as dictionaries don't allow duplicate keys. Also, objects need to implement NSCopying. NSValueTransformers or value-encapsulation are a way to sidestep these limitation though (same goes for the memory overhead from dictionary key copying).
While the implementation has its downsides, there's a big benefit: immensive versatility!
There's hardly any type graph that I could think of that's impossible to archieve with what I already have. Instead of building each type of graph with custom built parts you basically go to your box of lego bricks and assemble the graphs just the way you need them.
Some more insight:
Every major graph type has its own Protocol, here are a few:
HypergraphProtocol
MultigraphProtocol [tagging protocol] (allows parallel edges)
GraphProtocol (allows directed & undirected edges)
UndirectedGraphProtocol [tagging protocol] (allows only undirected edges)
DirectedGraphProtocol [tagging protocol] (allows only directed edges)
ForestProtocol (allows sets of disjunct trees)
TreeProtocol (allows trees)
ABTreeProtocol (allows trees of a-b children per vertex)
FullKAryTreeProtocol [tagging protocol] (allows trees of either 0 or k children per vertex)
The protocol nesting implies inharitance (of both protocols, as well as implementations).
If there's anything else you'd like to get some mor insight, feel free to leave a comment.
Ps: To give credit where credit is due: Architecture was highly influenced by the
JUNG Java graph framework (55k+ loc).
Pps: Before choosing this type of implementation I had written a small brother of it with just undirected graphs, that I wanted to expand to also support directed graphs. My implementation was pretty similar to the one you are providing in your question. This is what gave my first (rather naïve) project an abrupt end, back then: Subclassing a set of inter-dependent classes in Objective-C and ensuring type-safety Adding a simple directedness to my graph cause my entire code to break apart. (I didn't even use the solution that I posted back then, as it would have just postponed the pain) Now with the generic implementation I have more than 20 graph flavors implemented, with no hacks at all. It's worth it.
If all you want is drawing a graph and being able to move its nodes on the screen, though, you'd be fine with just implementing a generic graph class that can then later on be extended to specific directedness, if needed.
An adjacency matrix will have a bit more difficulty than your object model in adding and removing vertices (but not edges), since this involves adding and removing rows and columns from a matrix. There are tricks you could use to do this, like keeping empty rows and columns, but it will still be a bit complicated.
When moving a vertex around the screen, the edges will also be moved. This also gives your object model a slight advantage, since it will have a list of connected edges and will not have to search through the matrix.
Both models have an inherent directedness to the edges, so if you want to have undirected edges, then you will have to do additional work either way.
I would say that overall there is not a whole lot of difference. If I were implementing this, I would probably do something similar to what you are doing.
If you're using Objective-C I assume you have access to Core Data which would be probably be a great place to start - I understand you're creating your own graph, the strength of Core Data being that it can do a lot of the checking you're talking about for free if you set up your schema properly