Titan vertex centric indices vs Neo4j labels - indexing

I was trying to make a comparison between these two technologies when approaching this and I was wondering if any of you already have some experience dealing with any or both of them?
I am mainly interested in performance numbers when dealing with similar use cases.

The difference between the two concepts is the difference between global and local indexing.
As I understand it, Neo4j vertex labels allow you to break up your index space by "categories" of vertices. In this way, a O(log(|V|)) lookup is now an O(log(|V|/c)), where c is the number of categories/labels you have over your vertex set and (the equation) assumes an equal number of vertices in each category. As such, vertex label aid in global index calls as this is a function of V.
Next, Titan's vertex-centric indices sort and index the incident edges of a vertex. The cost to find a particular edge by its label/properties incident to a vertex is O(log(inc(v))), where inc(v) is the size of the incident edge set to vertex v. As such, vertex-centric indices are local indices as this is a function of v.
As I understand it, Neo4j does not support vertex-centric indices. You see this concept currently in Titan, OrientDB, and TinkerGraph (…and RDF stores sort in this manner as well -- via spog pairings). Next, all known graph databases support global indices though, (I believe only Neo4j and OrientDB), support a vertex set partition via the concept of a label.
Again, assuming my assumptions are correct about the use of vertex labels in Neo4j, we are talking about two different use cases — global vs. local indexing. From the perspective of the supernode problem, global indices do not quell the issue of traversing through a large vertex, while this is the sole purpose of the local vertex-centric indices.
You can read about the supernode problem and vertex-centric indices here:
http://thinkaurelius.com/2012/10/25/a-solution-to-the-supernode-problem/

Agreeing with everything Marko said, one could take it further and argue that in the graph database world local indexes can (and even should) substitute global ones. In my opinion, the single greatest advantage of a graph data model is that it lets you encode your data model into the graph topology, gaining qualitative advantages in terms of flexibility, ease of evolution and performance. With this in mind, I'd argue that labels in Neo4j actually detract from all this; reifying a label into a node with adjacent edges pointing to the source having that label is much more in line with the "schema is the graph" philosophy.
Of course, if your engine lacks local indexes we are back at the supernode problem. But if you do have them (something which I'd say should be a requirement for something to be called a graph database), you can easily transform your label into a node L, and create relationships pointing to that node for those vertices which you want labeled with L
v -[L]-> L
meaning that v has label L. Now if you want this in Titan to behave like a Neo4j label, just make the -[L]-> relation to be "manyToOne" (see Titan cardinality constraints) and create a vertex-centric index. This pattern lets you get everything that you could with labels and much more; you can
effectively use this as a namespace for properties relating to that label
sort your elements inside one label
nest labels easily without losing performance (just use a composite key)
separate the declaration of a label L with how elements labeled with it are accessed

Labels may afford some design patterns that improve performance by de-densifying the graph. For example: they eliminate the need for type nodes, which can often get quite dense. Labels can optionally be associated with a unique index. Here, the ability to index a property isn't new, but the ability to constrain it uniquely is. If you were previously doing work in your application, you may experience some performance gains by letting the database handle this. (It's certainly much more convenient to do so.) Finally, if you don't assign a unique index to a label, it will still be indexed, in order to help performance for certain kinds of queries (e.g. "give me all of the nodes having label ")
All that said, while labels may help with performance in certain cases, they were introduced more with ease-of-use in mind. We're just getting started with Neo4j 2.1, which specifically addresses dense node performance (something I know you've been waiting for), along with other performance & scalability improvements... including removing (for all practical purposes eliminating) the upper size limits.
Philip

Related

Using PCA on Part of Dataframe

I want to use a clustering algorithm to a dataframe that contains a lot of features (32 columns).
A part of the features are encoded using one hot encoder.
I want to use PCA ( Principal Component analysis ) to reduce the dimension and make the machine learning process easier.
Is it possible to use the PCA just for some columns of the data frame and keep the other columns as they are then use machine learning model.
Or it is obligatory to use PCA for all the dataframe before clustering.
I guess there should be no issue with doing what you describe.
What this does, effectively, is merge some of the objects' features into fewer ones, but then using other, non-merged ones in addition to the merged ones. I don't know what effect that would have on the outcome; it might be good to run a correlation to see whether the unmerged features add anything to the PCA-merged ones. You might find that they basically duplicate what is there already.
Since clustering is an exploratory method, you can basically do whatever you want. It is of course advisable to have a reason for doing so, as it otherwise ends up as simply trial-and-error, and if you find a result, you won't be able to describe why you got there. It is possible (or even likely for some data sets) that there are multiple ways to cluster them, so you should make decisions based on what you know about the data already, so they can be justified in those terms.
Running random trial-and-error clustering until you find a structure makes it a bit difficult to come up with a good explanation why that structure is valid.

OpenGL texture mapping without duplicate vertices

I've been using OpenGL for some time now, and I am at a point where I'm doing optimizing of the various aspects of my application.
Here's the problem: In some cases I end up duplicating vertixes for texture mapping. I've been using this book a lot:
http://staff.fit.ac.cy/eng.ap/FALL2017/3d-game-development-with-lwjgl.pdf
You can go to pages 70-71 and see the issue and the suggested solution. Basically, it says that when a vertex needs to have a different texture coord for different faces - you've got to duplicate that vertex.
And that that's the only solution.
But I hesitate - is it? Lots of UV maps will have cases where the coords are cut apart a bit, for vertexes. I.e. vertexes end up having different coords. Duplicating them seems like a huge waste.
Is there a way to do it a different way?
A vertex is not a just a position. A vertex is the whole vector of the position, color, texture coordinate and whatever else attribute you're putting into it.
But I hesitate - is it?
Yes it is. It is also the most efficient way to do this, as is allows for easier caching.
Is there a way to do it a different way?
You could use different arrays for each attribute, then have a primary index array to address into secondary index arrays, one for each attribute, that then do the lookup for each attribute.
Two indirections and very little cache coherence. Overall the overhead of secondary index arrays will consume more memory, than the little overhead of a "few" duplicate position vertices. And you're avoiding double indirection.
Overall, the way some duplicated vertex positions is also the most performant one.

Fast in-memory inverted index

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.
Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.
If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
org.apache.lucene.search.Similarity
you should be able to adapt it to your needs and set your version as default with
setDefault(Similarity similarity)
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing

Object Oriented implementation of graph data structures

I have been reading quite a bit graph data structures lately, as I have intentions of writing my own UML tool. As far as I can see, what I want can be modeled as a simple graph consisting of vertices and edges. Vertices will have a few values, and will so best be represented as objects. Edges does not, as far as I can see, need to be neither directed or weighted, but I do not want to choose an implementation that makes it impossible to include such properties later on.
Being educated in pure object oriented programming, the first things that comes to my mind is representing vertices and edges by classes, like for example:
Class: Vertice
- Array arrayOfEdges;
- String name;
Class: Edge
- Vertice from;
- Vertice to;
This gives me the possibility to later introduce weights, direction, and so on. Now, when I read up on implementing graphs, it seems that this is a very uncommon solution. Earlier questions here on Stack Overflow suggests adjacency lists and adjacency matrices, but being completely new to graphs, I have a hard time understanding why that is better than my approach.
The most important aspects of my application is having the ability to easily calculate which vertice is clicked and moved, and the ability to add and remove vertices and edges between the vertices. Will this be easier to accomplish in one implementation over another?
My language of choice is Objective-C, but I do not believe that this should be of any significance.
Here are the two basic graph types along with their typical implementations:
Dense Graphs:
Adjacency Matrix
Incidence Matrix
Sparse Graphs:
Adjacency List
Incidence List
In the graph framework (closed source, unfortunately) that I've ben writing (>12k loc graph implementations + >5k loc unit tests and still counting) I've been able to implement (Directed/Undirected/Mixed) Hypergraphs, (Directed/Undirected/Mixed) Multigraphs, (Directed/Undirected/Mixed) Ordered Graphs, (Directed/Undirected/Mixed) KPartite Graphs, as well as all kinds of Trees, such as Generic Trees, (A,B)-Trees, KAry-Trees, Full-KAry-Trees, (Trees to come: VP-Trees, KD-Trees, BKTrees, B-Trees, R-Trees, Octrees, …).
And all without a single vertex or edge class. Purely generics. And with little to no redundant implementations**
Oh, and as if this wasn't enough they all exist as mutable, immutable, observable (NSNotification), thread-unsafe and thread-safe versions.
How? Through excessive use of Decorators.
Basically all graphs are mutable, thread-unsafe and not observable. So I use Decorators to add all kinds of flavors to them (resulting in no more than 35 classes, vs. 500+ if implemented without decorators, right now).
While I cannot give any actual code, my graphs are basically implemented via Incidence Lists by use of mainly NSMutableDictionaries and NSMutableSets (and NSMutableArrays for my ordered Trees).
My Undirected Sparse Graph has nothing but these ivars, e.g.:
NSMutableDictionary *vertices;
NSMutableDictionary *edges;
The ivar vertices maps vertices to adjacency maps of vertices to incident edges ({"vertex": {"vertex": "edge"}})
And the ivar edges maps edges to incident vertex pairs ({"edge": {"vertex", "vertex"}}), with Pair being a pair data object holding an edge's head vertex and tail vertex.
Mixed Sparse Graphs would have a slightly different mapping of adjascency/incidence lists and so would Directed Sparse Graphs, but you should get the idea.
A limitation of this implementation is, that both, every vertex and every edge needs to have an object associated with it. And to make things a bit more interesting(sic!) each vertex object needs to be unique, and so does each edge object. This is as dictionaries don't allow duplicate keys. Also, objects need to implement NSCopying. NSValueTransformers or value-encapsulation are a way to sidestep these limitation though (same goes for the memory overhead from dictionary key copying).
While the implementation has its downsides, there's a big benefit: immensive versatility!
There's hardly any type graph that I could think of that's impossible to archieve with what I already have. Instead of building each type of graph with custom built parts you basically go to your box of lego bricks and assemble the graphs just the way you need them.
Some more insight:
Every major graph type has its own Protocol, here are a few:
HypergraphProtocol
MultigraphProtocol [tagging protocol] (allows parallel edges)
GraphProtocol (allows directed & undirected edges)
UndirectedGraphProtocol [tagging protocol] (allows only undirected edges)
DirectedGraphProtocol [tagging protocol] (allows only directed edges)
ForestProtocol (allows sets of disjunct trees)
TreeProtocol (allows trees)
ABTreeProtocol (allows trees of a-b children per vertex)
FullKAryTreeProtocol [tagging protocol] (allows trees of either 0 or k children per vertex)
The protocol nesting implies inharitance (of both protocols, as well as implementations).
If there's anything else you'd like to get some mor insight, feel free to leave a comment.
Ps: To give credit where credit is due: Architecture was highly influenced by the
JUNG Java graph framework (55k+ loc).
Pps: Before choosing this type of implementation I had written a small brother of it with just undirected graphs, that I wanted to expand to also support directed graphs. My implementation was pretty similar to the one you are providing in your question. This is what gave my first (rather naïve) project an abrupt end, back then: Subclassing a set of inter-dependent classes in Objective-C and ensuring type-safety Adding a simple directedness to my graph cause my entire code to break apart. (I didn't even use the solution that I posted back then, as it would have just postponed the pain) Now with the generic implementation I have more than 20 graph flavors implemented, with no hacks at all. It's worth it.
If all you want is drawing a graph and being able to move its nodes on the screen, though, you'd be fine with just implementing a generic graph class that can then later on be extended to specific directedness, if needed.
An adjacency matrix will have a bit more difficulty than your object model in adding and removing vertices (but not edges), since this involves adding and removing rows and columns from a matrix. There are tricks you could use to do this, like keeping empty rows and columns, but it will still be a bit complicated.
When moving a vertex around the screen, the edges will also be moved. This also gives your object model a slight advantage, since it will have a list of connected edges and will not have to search through the matrix.
Both models have an inherent directedness to the edges, so if you want to have undirected edges, then you will have to do additional work either way.
I would say that overall there is not a whole lot of difference. If I were implementing this, I would probably do something similar to what you are doing.
If you're using Objective-C I assume you have access to Core Data which would be probably be a great place to start - I understand you're creating your own graph, the strength of Core Data being that it can do a lot of the checking you're talking about for free if you set up your schema properly

When to choose Cassandra over a SQL/Semantic Store solution?

I have 30-40 GB of data and 3 developer machines (Core Duo i4, 3GB). The data is a set of graph like structures and I have queries that traverse the graphs. Is there a guideline that could help me to decide to use Cassandra or a classic solution, e.g., SQL or Semantic Store? My current plan is to set up Cassandra and see how does it work but I would like to learn more before starting the installation.
I would not use Cassandra for any kind of graph level structure. It has been about 6 months since I looked into doing something similar so maybe Cassandra has moved on since then but I found it was fundamentally limited by the fact that it only has row level indexes.
For a Graph based structure (assuming a simplistic one arc per row layout) you really need column indexes as well since if you want to traverse the graph you want to be able to start from a particular node A and find all the arcs that go from that node (assuming a directed Graph) then you'd have to do a row scan of the entire dataset as there is no built in functionality for saying give me the rows that have A in a particular column.
To achieve this you have to effectively design a data layout for Cassandra that gives you an inverted index. This is somewhat tricky and requires you to know ahead of time the type of queries that you want to answer - answering new types of queries at a later data may be very difficult or impossible if you don't design well. These slides demonstrate the idea but I hope it makes it clear that you effectively have to construct your own indexes.
For Graph structures that can be decomposed to triples consider an RDF store - for more complex structures then consider a full blown Graph Database. If you really want to do NoSQL you can probably build something on top of a document database as they tend to have much better indexing but again you'll have to think carefully about how you store your data.