Need Lucene query optimization advice - lucene

Am working on web based Job search application using Lucene.User on my site can search for jobs which are within a radius of 100 miles from say "Boston,MA" or any other location.
Also, I need to show the search results sorted by "relevance"(ie. Score returned by lucene) in descending order.
I'm using a 3rd party API to fetch all the cities within given radius of a city.This API returns me around 864 cities within 100 miles radius of "Boston,MA".
I'm building the city/state Lucene query using the following logic which is part of my "BuildNearestCitiesQuery" method.
Here nearestCities is a hashtable returned by the above API.It contains 864 cities with CityName ass key and StateCode as value.
And finalQuery is a Lucene BooleanQuery object which contains other search criteria entered by the user like:skills,keywords,etc.
foreach (string city in nearestCities.Keys)
{
BooleanQuery tempFinalQuery = finalQuery;
cityStateQuery = new BooleanQuery();
queryCity = queryParserCity.Parse(city);
queryState = queryParserState.Parse(((string[])nearestCities[city])[1]);
cityStateQuery.Add(queryCity, BooleanClause.Occur.MUST); //must is like an AND
cityStateQuery.Add(queryState, BooleanClause.Occur.MUST);
}
nearestCityQuery.Add(cityStateQuery, BooleanClause.Occur.SHOULD); //should is like an OR
finalQuery.Add(nearestCityQuery, BooleanClause.Occur.MUST);
I then input finalQuery object to Lucene's Search method to get all the jobs within 100 miles radius.:
searcher.Search(finalQuery, collector);
I found out this BuildNearestCitiesQuery method takes a whopping 29 seconds on an average to execute which obviously is unacceptable by any standards of a website.I also found out that the statements involving "Parse" take a considerable amount of time to execute as compared to other statements.
A job for a given location is a dynamic attribute in the sense that a city could have 2 jobs(meeting a particular search criteria) today,but zero job for the same search criteria after 3 days.So,I cannot use any "Caching" over here.
Is there any way I can optimize this logic?or for that matter my whole approach/algorithm towards finding all jobs within 100 miles using Lucene?
FYI,here is how my indexing in Lucene looks like:
doc.Add(new Field("jobId", job.JobID.ToString().Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("title", job.JobTitle.Trim(), Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field("description", job.JobDescription.Trim(), Field.Store.NO, Field.Index.TOKENIZED));
doc.Add(new Field("city", job.City.Trim(), Field.Store.YES, Field.Index.TOKENIZED , Field.TermVector.YES));
doc.Add(new Field("state", job.StateCode.Trim(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES));
doc.Add(new Field("citystate", job.City.Trim() + ", " + job.StateCode.Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED , Field.TermVector.YES));
doc.Add(new Field("datePosted", jobPostedDateTime, Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("company", job.HiringCoName.Trim(), Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field("jobType", job.JobTypeID.ToString(), Field.Store.NO, Field.Index.UN_TOKENIZED,Field.TermVector.YES));
doc.Add(new Field("sector", job.SectorID.ToString(), Field.Store.NO, Field.Index.UN_TOKENIZED, Field.TermVector.YES));
doc.Add(new Field("showAllJobs", "yy", Field.Store.NO, Field.Index.UN_TOKENIZED));
Thanks a ton for reading!I would really appreciate your help on this.
Janis

Not quite sure if I completely understand your code, but when it comes to geospatial search a filter approach might be more appropriate. Maybe this link can give you some ideas - http://sujitpal.blogspot.com/2008/02/spatial-search-with-lucene.html
Maybe you can use Filters for other parts of your query as well. To be honest your query looks quite complex.
--Hardy

Apart from tempFinalQuery being unused and an unnecessary map lookup to get the state, there doesn't seem to be anything too egregious in the code you post. Apart from the formatting...
If all the time is taken in the Parse methods, posting their code here would make sense.

I might have missed the point of your question but do you have the possibility of storing latitude and longitude for zip codes? If that is an option, you could then compute the distance between two coordinates providing a much more straightforward scoring metric.

I believe the best approach is to move the the nearest city determination into a search filter. I would also reconsider how you have the field setup; consider creating one term that has city+state so that would simplify the query.

I'd suggest:
storing the latitude and longitude of locations as they come in
when a user enters a city and distance, turn that into a lat/lon value and degrees
do a single, simple lookup based on numerical distance lat/lon comparisons
You can see an example of how this works in the Geo::Distance Perl module. Take a look at the closest method in the source, which implements this lookup via simple SQL.

Agree with the others here that this smells too much. Also doing a textual search on city names is not always that reliable. There is often a bit of subjectivity between place names (particularly areas within a city which might in themselves be large).
Doing a geo spatial query is the way to go. Not knowing the rest of your set up it is hard to advise. You do have Spatial support built into Fluent to NHibernate, and SQL Server 2008 for example. You could then do a search very quickly and efficiently. However, your challenge is to get this working within Lucene.
You could possibly do a "first pass" query using spatial support in SQL Server, and then run those results through Lucene?
The other major benefit of doing spatial queries is that you can then easily sort your results by distance which is a win for your customers.

Related

Lucene non indexed fields, case insensitive search?

Imagine that all documents have the following fields:
Field("Id", Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
Field("From", Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
Field("To", Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
Field("Source", Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
Field("Target", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
One of the requirements I have is to reuse the documents if From, To and Source are exactly the same (case insensitive).
However, they're not analyzed (e.g. with StandardAnalyzer which lowercases the terms before indexing).
Is it possible to do a case insensitive search to non analyzed
fields?
What about Field name values, can I also do a case
insensitive search to "From", "from", "FROM" ?
Overview:
I want to perform case insensitive search.
Example: "From:something", "from:Something", "FROM:SOMething", "from:SOMETHING" -> retrieve same result set.
1 - No. You can always lowercase the fields yourself before indexing, or analyze them with an analyzer consisting of a KeywordTokenizer and LowerCaseFilter. How you index in Lucene is very much a GIGO operation. If the way you analyze and index the fields doesn't enable your search needs, you'll have a rough time.
2 - Again, no (not that I'm aware of anyway). You'dd need to handle this in your code. If you just always use lowercase field names, it ought to be easy enough to normalize it though.

Lucene Indexing to ignore apostrophes

I have a field that might have apostrophes in it.
I want to be able to:
1. store the value as is in the index
2. search based on the value ignoring any apostrophes.
I am thinking of using:
doc.add(new Field("name", value, Store.YES, Index.NO));
doc.add(new Field("name", value.replaceAll("['‘’`]",""), Store.NO, Index.ANALYZED));
if I then do the same replace when searching I guess it should work and use the cleared value to index/search and the value as is for display.
am I missing any other considerations here ?
Performing replaceAll directly on the value its a bad practice in Lucene, since it would a much better practice to encapsulate your tokenization recipe in an Analyzer. Also I don't see the benefit of appending fields in your use case (See Document.add).
If you want to Store the original value and yet be able to search without the apostrophes simply declare your field like this:
doc.add(new Field("name", value, Store.YES, Index.ANALYZED);
Then simply hook up a custom Tokenizer that will replace apostrophes (I think the Lucene's StandardAnalyzer already includes this transformation).
If you are storing the field with the aim of using highlighting you should also consider using Field.TermVector.WITH_POSITIONS_OFFSETS.

Is it possible to add custom metadata to a Lucene field?

I've come to the point where I need to store some additional data about where a particular field comes from in my Lucene.Net index. Specifically, I want to attach a guid to certain fields of a document when the field is added to the document, and retrieve it again when I get the document from a search result.
Is this possible?
Edit:
Okay, let me clarify a bit by giving an example.
Let's say I have an object that I want to allow the user to tag with custom tags like "personal", "favorite", "some-project". I do this by adding multiple "tag" fields to the document, like so:
doc.Add( new Field( "tag", "personal" ) );
doc.Add( new Field( "tag", "favorite" ) );
The problem is I now need to record some meta data about each individual tag itself, specifically a guid representing where that tag came from (imagine it as a user id). Each tag could potentially have a different guid, so I can't simply create a "tag-guid" field (unless the order of the values is preserved---see edit 2 below). I don't need this metadata to be indexed (and in fact I'd prefer it not to be, to avoid getting hits on metadata), I just need to be able to retrieve it again from the document/field.
doc.GetFields( "tag" )[0].Metadata...
(I'm making up syntax here, but I hope my point is clear now.)
Edit 2:
Since this is a completely different question, I've posted a new question for this approach: Is the order of multi-valued fields in Lucene stable?
Okay let's try another approach... The key problem area is the indeterminacy of the multiple field values under the same field name (e.g. "tag"). If I could introduce or obtain some kind of determinacy here, I might be able to store the metadata in another field.
For example, if I could rely on the order of the values of the field never changing, I could use an index in the set of values to identify exactly which tag I am referring to.
Is there any guarantee that the order I add the values to a field will remain the same when I retrieve the document at a later time?
Depending on your search requirements for this index, this may be possible. That way you can control the order of fields. It would require updating both fields as the tag list changes of course, but the overhead may be worth it.
doc.Add(new Field("tags", "{personal}|{favorite}"));
doc.Add(new Field("tagsref", "{1234}|{12345}"));
Note: using the {} allows you to qualify your search for uniqueness where similar values exist.
Example: If values were stored as "person|personal|personage" searching for "person" would return a document that has any one of person, personal or personage. By qualifying in curly brackets like so: "{person}|{personal}|{personage}", I can search for "{person}" and be sure it won't return false positives. Of course, this assumes you don't use curly brackets in your values.
I think you're asking about payloads.
Edit: From your use case, it sounds like you have no desire to use this metadata in your search, you just want it there. (Basically, you want to use Lucene as a database system.)
So, why can't you use a binary field?
ExtraData ed = new ExtraData { Tag = "tag", Type = "personal" };
byte[] byteData = BinaryFormatter.Serialize(ed); // this isn't the correct code, but you get the point
doc.Add(new Field("myData", byteData, Field.Store.YES));
Then you can deserialize it on retrieval.

Indexing n-word expressions as a single term in Lucene

I want to index a "compound word" like "New York" as a single term in Lucene not like "new", "york". In such a way that if someone searches for "new place", documents containing "new york" won't match.
I think this is not the case for N-grams (actually NGramTokenizer), because I won't index just any n-gram, I want to index only some specific n-grams.
I've done some research and I know I should write my own Analyzer and maybe my own Tokenizer. But I'm a bit lost extending TokenStream/TokenFilter/Tokenizer.
Thanks
I presume you have some way of detecting the multi-word units (MWUs) that you want to preserve. Then what you can do is replace the whitespace in them by an underscore and use a WhiteSpaceAnalyzer instead of a StandardAnalyzer (which throws away punctuation), perhaps with a LowerCaseFilter.
Writing your own Tokenizer requires quite some Lucene black magic. I've never been able to wrap my head around the Lucene 2.9+ APIs, but check out the TokenStream docs if you really want to try.
I did it by creating the field which is indexed but not analyzed.
For this I used the Field.Index.NOT_ANALYZED
>
doc.add(new Field("fieldName", "value", Field.Store.YES, Field.Index.NOT_ANALYZED, TermVector.YES));
the StandardAnalyzer.
I worked on Lucene 3.0.2.

SpatialQuery for location based search using Lucene

My lucene index has got latitude and longitudes fields indexed as follows:
doc.Add(new Field("latitude", latitude.ToString() , Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("longitude", longitude.ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED));
I want to retrieve a set of documents from this index whose lat and long values are in a given range.
As you already know, Lat and long could be negative values. How do i correctly store signed decimal numbers in Lucene?
Would the approach mentioned below give correct results or is there any other way to do this?
Term lowerLatitude = new Term("latitude", bounds.South.ToString() );
Term upperLatitude = new Term("latitude", bounds.North.ToString());
RangeQuery latitudeRangeQuery = new RangeQuery(lowerLatitude, upperLatitude, true);
findLocationQuery.Add(latitudeRangeQuery, BooleanClause.Occur.SHOULD);
Term lowerLongitude = new Term("longitude", bounds.West.ToString());
Term upperLongitude = new Term("longitude", bounds.East.ToString());
RangeQuery longitudeRangeQuery = new RangeQuery(lowerLongitude, upperLongitude, true);
findLocationQuery.Add(longitudeRangeQuery, BooleanClause.Occur.SHOULD);
Also,I wanted to know how Lucene's ConstantScoreRangeQuery is better than RangeQuery class.
Am facing another problem in this context:
I've one of the documents in the index with the following 3 cities:
Lyons, IL
Oak Brook, IL
San Francisco, CA
If i give input as "Lyons, IL" then this record comes up.
But if i give San Francisco, CA as input, then it does not.
However, if i store the cities for this document as follows:
San Francisco, CA
Lyons, IL
Oak Brook, IL
and when i give San Francisco, CA as input, then this record shows in the search results.
What i want here is that if i type any of the 3 cities in input,I should get this document in the search results.
Please help me achieve this.
Thanks.
Following up on skaffman's suggestion, you can use the same tile coordinate system used by all the popular map apps. Choose whatever zoom level is granular enough for your needs, and don't forget to pad with leading zeros.
Regarding RangeQuery, it's slower than ConstantScoreRangeQuery and limits the range of values.
Regarding the city-state problem, we can only speculate. But the first things to check are that the indexed terms and the parsed query are what you expect them to be.
I think the best way is to convert/normalize the coordinates as suggested in the previous post. This article does exactly this. It's actually quite nice object orientated code.
Regarding your second problem. I would assume you have some sort of Analyzer problem. Are you using the same Analyzer for indexing and querying? Which tokenizers do you use?
I recommend to use Luke to inspect your generated index to see what tokens are actually searchable.
--Hardy
One option here is to convert the coordinates into a system that doesn't have negative numbers. For example, I've had a similar problem for a google maps webapp for the UK, and I stored UK Easting/Northings (which range from 0 to 7 digits) fields in Lucene alongside the lat/long values. By formatting these eastings/northings with left-padded zeroes, I could do lucene range queries.
Is there a similar coordinate system for the US?