I have the following use case:
I want to be able to search my lucene documents within a certain circle of radius x kms from the given user lat long.
I also want to sort the documents by distance.
I also need the distnace values later on to display to user.
Which spatial strategy would be best for me without indexing anything extra and considering performance.
According to your requirements, I think the best choice could be the PointVectorStrategy, which is the simplest one and also satisfy your conditions:
Simple SpatialStrategy which represents Points in two numeric fields.
The Strategy's best feature is decent distance sort.
Characteristics:
Only indexes points; just one per field value.
Can query by a rectangle or circle.
SpatialOperation.Intersects and SpatialOperation.IsWithin is supported.
Requires DocValues for SpatialStrategy.makeDistanceValueSource(org.locationtech.spatial4j.shape.Point)
and for searching with a Circle.
Yes, it will require you to have DocValues indexed, but if I understand correctly, none of the spatial strategies will be provide needed functionality for free.
Related
I was trying to make a comparison between these two technologies when approaching this and I was wondering if any of you already have some experience dealing with any or both of them?
I am mainly interested in performance numbers when dealing with similar use cases.
The difference between the two concepts is the difference between global and local indexing.
As I understand it, Neo4j vertex labels allow you to break up your index space by "categories" of vertices. In this way, a O(log(|V|)) lookup is now an O(log(|V|/c)), where c is the number of categories/labels you have over your vertex set and (the equation) assumes an equal number of vertices in each category. As such, vertex label aid in global index calls as this is a function of V.
Next, Titan's vertex-centric indices sort and index the incident edges of a vertex. The cost to find a particular edge by its label/properties incident to a vertex is O(log(inc(v))), where inc(v) is the size of the incident edge set to vertex v. As such, vertex-centric indices are local indices as this is a function of v.
As I understand it, Neo4j does not support vertex-centric indices. You see this concept currently in Titan, OrientDB, and TinkerGraph (…and RDF stores sort in this manner as well -- via spog pairings). Next, all known graph databases support global indices though, (I believe only Neo4j and OrientDB), support a vertex set partition via the concept of a label.
Again, assuming my assumptions are correct about the use of vertex labels in Neo4j, we are talking about two different use cases — global vs. local indexing. From the perspective of the supernode problem, global indices do not quell the issue of traversing through a large vertex, while this is the sole purpose of the local vertex-centric indices.
You can read about the supernode problem and vertex-centric indices here:
http://thinkaurelius.com/2012/10/25/a-solution-to-the-supernode-problem/
Agreeing with everything Marko said, one could take it further and argue that in the graph database world local indexes can (and even should) substitute global ones. In my opinion, the single greatest advantage of a graph data model is that it lets you encode your data model into the graph topology, gaining qualitative advantages in terms of flexibility, ease of evolution and performance. With this in mind, I'd argue that labels in Neo4j actually detract from all this; reifying a label into a node with adjacent edges pointing to the source having that label is much more in line with the "schema is the graph" philosophy.
Of course, if your engine lacks local indexes we are back at the supernode problem. But if you do have them (something which I'd say should be a requirement for something to be called a graph database), you can easily transform your label into a node L, and create relationships pointing to that node for those vertices which you want labeled with L
v -[L]-> L
meaning that v has label L. Now if you want this in Titan to behave like a Neo4j label, just make the -[L]-> relation to be "manyToOne" (see Titan cardinality constraints) and create a vertex-centric index. This pattern lets you get everything that you could with labels and much more; you can
effectively use this as a namespace for properties relating to that label
sort your elements inside one label
nest labels easily without losing performance (just use a composite key)
separate the declaration of a label L with how elements labeled with it are accessed
Labels may afford some design patterns that improve performance by de-densifying the graph. For example: they eliminate the need for type nodes, which can often get quite dense. Labels can optionally be associated with a unique index. Here, the ability to index a property isn't new, but the ability to constrain it uniquely is. If you were previously doing work in your application, you may experience some performance gains by letting the database handle this. (It's certainly much more convenient to do so.) Finally, if you don't assign a unique index to a label, it will still be indexed, in order to help performance for certain kinds of queries (e.g. "give me all of the nodes having label ")
All that said, while labels may help with performance in certain cases, they were introduced more with ease-of-use in mind. We're just getting started with Neo4j 2.1, which specifically addresses dense node performance (something I know you've been waiting for), along with other performance & scalability improvements... including removing (for all practical purposes eliminating) the upper size limits.
Philip
my cousin has created a semantic search engine and he claims that his search engine is the most accurate.
I've seen many semantic search engine and they all look the same to me, because they are not designed to give you results based on the keyword you type.
So if you are creating a semantic search engine, how to to determine the accuracy of its results?
Actually sarnold's suggestion is not far off the mark.
What you would typically do is to take a whole bunch of people and have them try out a bunch of standard queries. Or if you wanted to make the experiment fairer you might let each user pick their own queries to avoid any accusation of bias (because you could pick standard queries you knew your engine was good at answering).
For each query the user would look through the first 10 or so results and say whether they thought each result was relevant or not (you may want to have users score on a scale rather than just yes/no).
Then for each of the queries you can calculate accuracy scores, depending on exactly how you set up the experiment Precision and Recall may be the most appropriate measures though these rely on having a known expected answer which you may not necessarily have. It may be simpler and more appropriate to calculate a simple percentage accuracy.
To determine whether your search engine was better than your competitors you'd have the same people perform the same queries on those search engines scoring in the same way. Having done this you can then calculate and compare the scores for the search engines against your own.
I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.
Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.
If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
org.apache.lucene.search.Similarity
you should be able to adapt it to your needs and set your version as default with
setDefault(Similarity similarity)
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing
Context
This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search, in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car).
When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices is unbearable.
Solr is advertised to cope well with the faceted search task, which if I think correctly has to be connected with Lucene (supposedly) performing well on multi-field queries (where fields of a document relate to facets of an object).
Question
The inverted index of Lucene can be stored in a relational database, and naturally taking the intersections of the matching documents can also be trivially achieved with RDBMS using single-field indices.
Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index.
So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)?
Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.
Faceting
There are two answers for faceting, because there are two types of faceting. I'm not certain that either of these are faster than an RDBMS.
Enum faceting. Results of a query are a bit vector where the ith bit is 1 if the ith document was a match. The facet is also a bit vector, so intersection is just a bitwise AND. I don't think this is a novel approach, and most RDBMS's probably support it.
Field Cache. This is just a normal (non-inverted) index. The SQL-style query that is run here is like:
select facet, count(*) from field_cache
where docId in query_results
group by facet
Again, I don't think this is anything that a normal RDBMS couldn't do. The index is a skip list, with the docId as the key.
Multi-term search
This is where Lucene shines. Why Lucene's approach is so good is too long to post here, but I can recommend this post on Lucene Performance, or the papers linked therein.
An explaining post can be found at: http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
The new method works by un-inverting the indexed field to be faceted, allowing quick lookup of the terms in the field for any given document. It’s actually a hybrid approach – to save memory and increase speed, terms that appear in many documents (over 5%) are not un-inverted, instead the traditional set intersection logic is used to get the counts.
I know it takes in a float, but what are some typical values for various levels of boosting within a result?
For example:
If I wanted to boost a document's weighting by 10% then I should set it 1.1?
For 20% then 1.2?
What happens if I start setting boosts to values like 75.0? or 500.0?
Edit: Fixed Formatting
Please see the Lucene Similarity Documentation for the formula. In principle, all other factors being equal, setting a document's boost to 1.1 will indeed give it a score that is 10% higher as compared to an identical document with a boost of 1.0. If you have a set of documents that should be intrinsically preferred in searches, this may be a good idea. Note that document boost is an indexing-time attribute, making it impossible to change the document's boost without reindexing it.
There are other important factors in scoring - including term match scores, norms etc.
See Debugging Relevance Issues in Search for details.
Adding to what Yuval has said. This value is function of field boost & document boost. The boost values are encoded in a single byte. So, the precision might be lost while storing this value. Debugging with Searcher.Explain() would help you get the right amount of boost.
If you want the boost value to be preserved (it's useful, for example, when you want to recreate index from current index), you may add it in a stored field.
The important thing to remember about boosting is not to approach it in isolation, you need to consider it as part of a global strategy, make a list of each criteria used to effect the relevancy and then order those criteria. Define a relationship between each of those criteria. Are you regularly re-indexing or are you just adding new documents, if you are regularly re-indexing, you can afford to tune your document boost criteria, if not you need to think it through thoroughly beforehand.