What is the correct way to implement a custom ranking algorithm for Solr/Lucene?
I read about Zvents implementing a Distance Weighting ranking system for documents which correspond to events in a specific geographic area (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Zvents).
I would like to do something similar: I index ads in different cities and I would like to boost the relevance of nearest ads given a specific location.
Local Lucene is a project meant to add local search to Lucene. Basically, you add spatial coordinates to the index fields. You then have to decide, based on your index structure, whether it is better to first search according to textual matches and then filter by geographic location, or the other way around. Lucene in Action has an example of a spatial result filter. The forthcoming second edition will probably have more in that direction. See also the LocalSolr wiki page.
Related
I have the following use case:
I want to be able to search my lucene documents within a certain circle of radius x kms from the given user lat long.
I also want to sort the documents by distance.
I also need the distnace values later on to display to user.
Which spatial strategy would be best for me without indexing anything extra and considering performance.
According to your requirements, I think the best choice could be the PointVectorStrategy, which is the simplest one and also satisfy your conditions:
Simple SpatialStrategy which represents Points in two numeric fields.
The Strategy's best feature is decent distance sort.
Characteristics:
Only indexes points; just one per field value.
Can query by a rectangle or circle.
SpatialOperation.Intersects and SpatialOperation.IsWithin is supported.
Requires DocValues for SpatialStrategy.makeDistanceValueSource(org.locationtech.spatial4j.shape.Point)
and for searching with a Circle.
Yes, it will require you to have DocValues indexed, but if I understand correctly, none of the spatial strategies will be provide needed functionality for free.
I want to be able to determine if a GPS location is in an inhabited or uninhabited zone.
I have tried several reverse geocoding API like Nominatim, but failed to get good results. It always returns the nearest possible address, even when I selected a location in the middle of a forest.
Is there any way to determine this with reasonable accuracy? Are there any databases or web services for this ?
If you have to calculate that youself, then the interesting things start:
The information whether or not a region is inhabited is stored in digital maps in layer "Land_Use". There are values for Forest, Water, Industry, Cemetary, etc.
You would have to import these Land_use polygons into a special DB (PostGres).
Such a spatial DB provides fast geo indizeds for searching only the relevant polygons.
Some countries may also fit in main memory, but then you need some kind of geo spatial index, like Quad-Tree or k-d tree to store the polygons.
Once you have imported the polygons, it is a simple "point in polygon" query, or "polygons within radius r". The typoe of th epolygon denotes the land use.
OpenStreetMap provides these polygons for free.
Otherwise you have to buy them from TomTom or probably NavTeq (Nokia Maps). But this makes only sense for major companies.
Since you're using Nominatim, you're getting the coordinates of the nearest address back in the reply.
Since the distance between two coordinates can be calculated, you can just use that to calculate the distance to the closest address found, and from that figure out if you're close to populated areas or not.
I am looking to perform live A/B and controlled side-by-side experiments to help understand how changes affect search quality. I will be testing variables such as boost value and fuzzyqueries.
What other metrics are used to determine whether users prefer A vs B? Here are 2 metrics I found online...
In Google Analytics, “% Search Exits” is a metric you can use to
measure the quality of your site-search results
Another way to measure search quality is to measure the number of
search result pages the visitor views.
Search Quality is something not easily measurable. For measuring relevance you need to have couple of things:
A competitor to measure relevance. For your case the different instance of your search engine will be the competitors for each other. I mean one search engine instance would have the basic algorithm running, the other with fuzzy enabled, another with both fuzzy and boosting and so on.
You need to manually rate the results. You can ask your colleagues to rate query/url pairs for popular queries and then for the holes(i.e. query/url pair not rated you can have some dynamic ranking function by using "Learning to Rank" Algorithm http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true (please read below of an example of Google/Bing).
Google and Bing are competitors in the horizontal search market. These search engines employ manual judges around the world and invest millions on them, to rate their results for queries. So for each query/url pairs generally top 3 or top 5 results are rated. Based on these ratings they may use a metric like NDCG (Normalized Discounted Cumulative Gain) , which is one of finest metric and the one of most popular one.
According to wikipedia:
Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine >algorithm or related applications, often used in information retrieval. Using a graded >relevance scale of documents in a search engine result set, DCG measures the usefulness, >or gain, of a document based on its position in the result list. The gain is accumulated >from the top of the result list to the bottom with the gain of each result discounted at >lower ranks.
Wikipedia explains NDCG in a great manner. It is a short article, please go through that.
As you have mentioned you can also have click through rate/data where in you have kind of wisdom of crowd Algorithm and you tweak the relevance based on that. It is a very good way out but it attracts spamming. So it has to be coupled with some metric such as NDCG/MAP etc. to solve your relevance problem.
I can provide more details on this if you still need to know more on how whole stuff put together would work in your case study.
We are using Solr 3.3 with Solr.NET and we have put a dynamic "location_p" location type field on our documents and now we need the ability to do spatial searches.
I have got the radius searches (distance from a given point) working like this;
{!geofilt sfield=location_p pt=33.882518712472255,-84.05531775646972 d=1.7}
Now we need the ability to do a Polygon squery to get all documents with the "location_p" field 'inside' a given set of Points (something along the lines of the Polygon search capabilities of ElasticSearch).
This is really different than the BBox query filter as the points of the Polygon are not symmetrical, more random based on user 'click' points.
Any ideas or suggestions would be appreciated.
As far as I know Solr doesn't currently implement polygon spatial search.
There are a couple of efforts towards implementing this (SOLR-2155, SOLR-2268). Try applying one of these patches, test it, contribute to the project.
There's also Spatial Solr plugin, which implements polygon search but is only compatible with Solr 1.4.
See also http://wiki.apache.org/incubator/SpatialProposal
Context
This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search, in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car).
When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices is unbearable.
Solr is advertised to cope well with the faceted search task, which if I think correctly has to be connected with Lucene (supposedly) performing well on multi-field queries (where fields of a document relate to facets of an object).
Question
The inverted index of Lucene can be stored in a relational database, and naturally taking the intersections of the matching documents can also be trivially achieved with RDBMS using single-field indices.
Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index.
So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)?
Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.
Faceting
There are two answers for faceting, because there are two types of faceting. I'm not certain that either of these are faster than an RDBMS.
Enum faceting. Results of a query are a bit vector where the ith bit is 1 if the ith document was a match. The facet is also a bit vector, so intersection is just a bitwise AND. I don't think this is a novel approach, and most RDBMS's probably support it.
Field Cache. This is just a normal (non-inverted) index. The SQL-style query that is run here is like:
select facet, count(*) from field_cache
where docId in query_results
group by facet
Again, I don't think this is anything that a normal RDBMS couldn't do. The index is a skip list, with the docId as the key.
Multi-term search
This is where Lucene shines. Why Lucene's approach is so good is too long to post here, but I can recommend this post on Lucene Performance, or the papers linked therein.
An explaining post can be found at: http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
The new method works by un-inverting the indexed field to be faceted, allowing quick lookup of the terms in the field for any given document. It’s actually a hybrid approach – to save memory and increase speed, terms that appear in many documents (over 5%) are not un-inverted, instead the traditional set intersection logic is used to get the counts.