Fast in-memory inverted index - indexing

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.

I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).

Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.

Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.

If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
org.apache.lucene.search.Similarity
you should be able to adapt it to your needs and set your version as default with
setDefault(Similarity similarity)
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing

Related

R-Tree and Quadtree Comparison

I want to compare the R-Tree and the Quadtree for geospatial data. While there is literature out there I struggle to find documents that cover real basic comparison. So I decided to ask this question.
In my opinion, the R-Tree has the advantage of being balanced and the tree has no empty leaves.
As a disadvantage, the basic operation like insert or delete could result in restructering the whole index.
The Quadtree is the opposite, it is not balanced and has empty leaves, but it does not need to be restrctured.
So as a fazit from that I would say that the R-Tree does need less memory and is faster for searching because of the minimal height.
The quadtree is better when there are many update-operations, but the resulting tree could be unbalanced.
Are these points correct in your opinion?
Are there any good documents out there that cover this topic?
Auf Wiedersehen, Andre
Here's paper which has pretty nice comparison of QuadTrees and R Trees:
Quadtree and R-tree Indexes in Oracle Spatial: A Comparison using GIS Data
Some differences:
Quadtrees require fine-tuning by choosing appropriate tiling level in order to optimize performance. No specific tuning is required for R-Trees.
Quadtree can be implemented on top of existing B-tree. R-Tree -cannot
Quadtree indexes are created faster than R-tree.
R-trees are much faster than Quadtree for nearest neighbours queries.
R-trees are substantially faster than Quadtree for window queries, like "inside", "contains", "covers" etc.
"restructuring the whole index". No. Restructuring the R-tree is restricted to a single path, not the "whole" index.
It works similar to the B-tree, actually.
Consider implementing both, and doing some benchmarks yourself, to really know how they perform. Don't only use theory.
On uniformly distributed data with a high change frequency, quadtrees will usually work better. On disk, the R-tree has clear advantages.

Titan vertex centric indices vs Neo4j labels

I was trying to make a comparison between these two technologies when approaching this and I was wondering if any of you already have some experience dealing with any or both of them?
I am mainly interested in performance numbers when dealing with similar use cases.
The difference between the two concepts is the difference between global and local indexing.
As I understand it, Neo4j vertex labels allow you to break up your index space by "categories" of vertices. In this way, a O(log(|V|)) lookup is now an O(log(|V|/c)), where c is the number of categories/labels you have over your vertex set and (the equation) assumes an equal number of vertices in each category. As such, vertex label aid in global index calls as this is a function of V.
Next, Titan's vertex-centric indices sort and index the incident edges of a vertex. The cost to find a particular edge by its label/properties incident to a vertex is O(log(inc(v))), where inc(v) is the size of the incident edge set to vertex v. As such, vertex-centric indices are local indices as this is a function of v.
As I understand it, Neo4j does not support vertex-centric indices. You see this concept currently in Titan, OrientDB, and TinkerGraph (…and RDF stores sort in this manner as well -- via spog pairings). Next, all known graph databases support global indices though, (I believe only Neo4j and OrientDB), support a vertex set partition via the concept of a label.
Again, assuming my assumptions are correct about the use of vertex labels in Neo4j, we are talking about two different use cases — global vs. local indexing. From the perspective of the supernode problem, global indices do not quell the issue of traversing through a large vertex, while this is the sole purpose of the local vertex-centric indices.
You can read about the supernode problem and vertex-centric indices here:
http://thinkaurelius.com/2012/10/25/a-solution-to-the-supernode-problem/
Agreeing with everything Marko said, one could take it further and argue that in the graph database world local indexes can (and even should) substitute global ones. In my opinion, the single greatest advantage of a graph data model is that it lets you encode your data model into the graph topology, gaining qualitative advantages in terms of flexibility, ease of evolution and performance. With this in mind, I'd argue that labels in Neo4j actually detract from all this; reifying a label into a node with adjacent edges pointing to the source having that label is much more in line with the "schema is the graph" philosophy.
Of course, if your engine lacks local indexes we are back at the supernode problem. But if you do have them (something which I'd say should be a requirement for something to be called a graph database), you can easily transform your label into a node L, and create relationships pointing to that node for those vertices which you want labeled with L
v -[L]-> L
meaning that v has label L. Now if you want this in Titan to behave like a Neo4j label, just make the -[L]-> relation to be "manyToOne" (see Titan cardinality constraints) and create a vertex-centric index. This pattern lets you get everything that you could with labels and much more; you can
effectively use this as a namespace for properties relating to that label
sort your elements inside one label
nest labels easily without losing performance (just use a composite key)
separate the declaration of a label L with how elements labeled with it are accessed
Labels may afford some design patterns that improve performance by de-densifying the graph. For example: they eliminate the need for type nodes, which can often get quite dense. Labels can optionally be associated with a unique index. Here, the ability to index a property isn't new, but the ability to constrain it uniquely is. If you were previously doing work in your application, you may experience some performance gains by letting the database handle this. (It's certainly much more convenient to do so.) Finally, if you don't assign a unique index to a label, it will still be indexed, in order to help performance for certain kinds of queries (e.g. "give me all of the nodes having label ")
All that said, while labels may help with performance in certain cases, they were introduced more with ease-of-use in mind. We're just getting started with Neo4j 2.1, which specifically addresses dense node performance (something I know you've been waiting for), along with other performance & scalability improvements... including removing (for all practical purposes eliminating) the upper size limits.
Philip

How does Lucene/Solr achieve high performance in multi-field / faceted search?

Context
This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search, in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car).
When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices is unbearable.
Solr is advertised to cope well with the faceted search task, which if I think correctly has to be connected with Lucene (supposedly) performing well on multi-field queries (where fields of a document relate to facets of an object).
Question
The inverted index of Lucene can be stored in a relational database, and naturally taking the intersections of the matching documents can also be trivially achieved with RDBMS using single-field indices.
Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index.
So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)?
Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.
Faceting
There are two answers for faceting, because there are two types of faceting. I'm not certain that either of these are faster than an RDBMS.
Enum faceting. Results of a query are a bit vector where the ith bit is 1 if the ith document was a match. The facet is also a bit vector, so intersection is just a bitwise AND. I don't think this is a novel approach, and most RDBMS's probably support it.
Field Cache. This is just a normal (non-inverted) index. The SQL-style query that is run here is like:
select facet, count(*) from field_cache
where docId in query_results
group by facet
Again, I don't think this is anything that a normal RDBMS couldn't do. The index is a skip list, with the docId as the key.
Multi-term search
This is where Lucene shines. Why Lucene's approach is so good is too long to post here, but I can recommend this post on Lucene Performance, or the papers linked therein.
An explaining post can be found at: http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
The new method works by un-inverting the indexed field to be faceted, allowing quick lookup of the terms in the field for any given document. It’s actually a hybrid approach – to save memory and increase speed, terms that appear in many documents (over 5%) are not un-inverted, instead the traditional set intersection logic is used to get the counts.

Sphinx/Solr/Lucene/Elastic Relevancy

We have an extremely large database of 30+ Million products, and need to query them to create search results and ad displays thousands of times a second. We have been looking into Sphinx, Solr, Lucene, and Elastic as options to perform these constant massive searches.
Here's what we need to do. Take keywords and run them through the database to find products that match the closest. We're going to be using our OWN algorithm to decide which products are most related to target our advertisements, but we know that these engines already have their own relevancy algorithms.
So, our question is how can we use our own algorithms on top of the engine's, efficiently. Is it possible to add them to the engines themselves as a module of some sort? Or would we have to rewrite the engine's relevancy code? I suppose we could implement the algorithm from the application by executing multiple queries, but this would really kill efficiency.
Also, we'd like to know which search solution would work best for us. Right now we're leaning towards Sphinx, but we're really not sure.
Also, would you recommend running these engines over MySQL, or would it be better to run them over some type of key-value store like Cassandra? Keep in mind there are 30 Million records, and likely to double as we move along.
Thanks for your responses!
I can't give you an entire answer, as I haven't used all the products, but I can say some things which might help.
Lucene/Solr uses a vector space model. I'm not certain what you mean by you're using your "own" algorithm, but if it gets too far away from the notion of tf/idf (say, by using a neural net) you're going to have difficulties fitting it into lucene. If by your own algorithm you just mean you want to weight certain terms more heavily than others, that will fit in fine. Basically, lucene stores information about how important a term is to a document. If you want to redefine the calculation of how important a term is, that's easy to do. If you want to get away from the whole notion of a term's importance to a document, that's going to be a pain.
Lucene (and as a result Solr) stores things in its custom format. You don't need to use a database. 30 million records is not an remarkably large lucene index (depending, of course, on how big each record is). If you do want to use a db, use hadoop.
In general, you will want to use Solr instead of Lucene.
I have found it very easy to modify Lucene. But as my first bullet point said, if you want to use an algorithm that's not based on some notion of a term's importance to a document, I don't think Lucene will be the way to go.
I actually did something similar with Solr. I can't comment on the details, but basically the proprietary analysis/relevance step generated a series of search terms with associated boosts and fed them to Solr. I think this can be done with any search engine (they all support some sort of boosting).
Ultimately it comes down to what your particular analysis requires.

Lucene / Lucene.NET - Document.SetBoost() values?

I know it takes in a float, but what are some typical values for various levels of boosting within a result?
For example:
If I wanted to boost a document's weighting by 10% then I should set it 1.1?
For 20% then 1.2?
What happens if I start setting boosts to values like 75.0? or 500.0?
Edit: Fixed Formatting
Please see the Lucene Similarity Documentation for the formula. In principle, all other factors being equal, setting a document's boost to 1.1 will indeed give it a score that is 10% higher as compared to an identical document with a boost of 1.0. If you have a set of documents that should be intrinsically preferred in searches, this may be a good idea. Note that document boost is an indexing-time attribute, making it impossible to change the document's boost without reindexing it.
There are other important factors in scoring - including term match scores, norms etc.
See Debugging Relevance Issues in Search for details.
Adding to what Yuval has said. This value is function of field boost & document boost. The boost values are encoded in a single byte. So, the precision might be lost while storing this value. Debugging with Searcher.Explain() would help you get the right amount of boost.
If you want the boost value to be preserved (it's useful, for example, when you want to recreate index from current index), you may add it in a stored field.
The important thing to remember about boosting is not to approach it in isolation, you need to consider it as part of a global strategy, make a list of each criteria used to effect the relevancy and then order those criteria. Define a relationship between each of those criteria. Are you regularly re-indexing or are you just adding new documents, if you are regularly re-indexing, you can afford to tune your document boost criteria, if not you need to think it through thoroughly beforehand.