Lucene non indexed fields, case insensitive search? - lucene

Imagine that all documents have the following fields:
Field("Id", Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
Field("From", Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
Field("To", Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
Field("Source", Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
Field("Target", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
One of the requirements I have is to reuse the documents if From, To and Source are exactly the same (case insensitive).
However, they're not analyzed (e.g. with StandardAnalyzer which lowercases the terms before indexing).
Is it possible to do a case insensitive search to non analyzed
fields?
What about Field name values, can I also do a case
insensitive search to "From", "from", "FROM" ?
Overview:
I want to perform case insensitive search.
Example: "From:something", "from:Something", "FROM:SOMething", "from:SOMETHING" -> retrieve same result set.

1 - No. You can always lowercase the fields yourself before indexing, or analyze them with an analyzer consisting of a KeywordTokenizer and LowerCaseFilter. How you index in Lucene is very much a GIGO operation. If the way you analyze and index the fields doesn't enable your search needs, you'll have a rough time.
2 - Again, no (not that I'm aware of anyway). You'dd need to handle this in your code. If you just always use lowercase field names, it ought to be easy enough to normalize it though.

Related

Indexing a integer in lucene

I am using followng code to index a integer value
String key = hmap.get("key");
System.out.println("key == "+Integer.parseInt(key));
if(key!=null && key.trim().length()>0)
doc.add(new IntField("kv", Integer.parseInt(key),IndexFieldTypes.getFieldType(INDEX_STORE_FIELD)));
The problem is if 'key' is '50' the line 'key== 50' get printed well but when it reach 'doc.add' line it throw following exception:
java.lang.IllegalArgumentException: type.numericType() must be INT but got null
at org.apache.lucene.document.IntField.<init>(IntField.java:171)
Can someone figure out.
An IntField must have a NumericFieldType of FieldType.NumericType.INT. Of course, I don't have intimate knowledge of your IndexFieldTypes class, but I would guess it's default INDEX_STORE_FIELD has no numeric type (rightly so, if it is non-null lucene will try to index as a number).
You may not necessarily need to pass a field type to IntField though, you could just do something like:
doc.add(new IntField("kv", Integer.parseInt(key), Field.Store.YES));
If you do need to define a FieldType, either use a different type from existing functionality in IndexFieldTypes, or implement logic to create an IntField from it. Or just set the NumericFieldType after it is retreived, like:
FieldType type = IndexFieldTypes.getFieldType(INDEX_STORE_FIELD);
type.setNumericFieldType(FieldType.NumericType.INT);
doc.add(new IntField("kv", Integer.parseInt(key), type));

Lucene Indexing to ignore apostrophes

I have a field that might have apostrophes in it.
I want to be able to:
1. store the value as is in the index
2. search based on the value ignoring any apostrophes.
I am thinking of using:
doc.add(new Field("name", value, Store.YES, Index.NO));
doc.add(new Field("name", value.replaceAll("['‘’`]",""), Store.NO, Index.ANALYZED));
if I then do the same replace when searching I guess it should work and use the cleared value to index/search and the value as is for display.
am I missing any other considerations here ?
Performing replaceAll directly on the value its a bad practice in Lucene, since it would a much better practice to encapsulate your tokenization recipe in an Analyzer. Also I don't see the benefit of appending fields in your use case (See Document.add).
If you want to Store the original value and yet be able to search without the apostrophes simply declare your field like this:
doc.add(new Field("name", value, Store.YES, Index.ANALYZED);
Then simply hook up a custom Tokenizer that will replace apostrophes (I think the Lucene's StandardAnalyzer already includes this transformation).
If you are storing the field with the aim of using highlighting you should also consider using Field.TermVector.WITH_POSITIONS_OFFSETS.

Lucene - Effective text search

I have an index generated by the pdfbox api class LucenePDFDocument. As the index contains only the text contents, I wish to search this index effectively.
I will search the 'contents' field with the search string, the result order must be from the most relevant to the less relevant. The code given below did displayed the files that has the words of the searched text, ex 'What is your nationality' but the results didnt contain a file containing this full sentence.
What query parser and query should i use to search in the above said scenario.
Query query = new MultiFieldQueryParser(Version.LUCENE_30, fields,
new StandardAnalyzer(Version.LUCENE_30))
.parse(searchString);
TopScoreDocCollector collector = TopScoreDocCollector.create(5,
false);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("count " + hits.length);
for (ScoreDoc scoreDoc : hits) {
int docId = scoreDoc.doc;
Document d = searcher.doc(docId);
System.out.println(d.getField("path"));
}
It's not about programmatic part, but about Lucene quesry syntax. To search whole phrase just wrap it with double quotes, i.e. instead of searching
What is your nationality
search
"What is your nationality"
Without quotes Lucene finds all documents with each separate word, i.e. "what", "is", "your" and "nationality" ("is" and "your" may be omitted as stop words) and sort them by overall number of occurrences in doc, not only in that phrase. Since you set number of docs to find only to 5 in TopScoreDocCollector, the file with the phrase may not occur in results. Adding quotes makes Lucene to ignore all other docs without exact phrase.
Also if you search only in 'contents' field, you need not MultiFieldQueryParser and can use simple QueryParser instead.

Indexing n-word expressions as a single term in Lucene

I want to index a "compound word" like "New York" as a single term in Lucene not like "new", "york". In such a way that if someone searches for "new place", documents containing "new york" won't match.
I think this is not the case for N-grams (actually NGramTokenizer), because I won't index just any n-gram, I want to index only some specific n-grams.
I've done some research and I know I should write my own Analyzer and maybe my own Tokenizer. But I'm a bit lost extending TokenStream/TokenFilter/Tokenizer.
Thanks
I presume you have some way of detecting the multi-word units (MWUs) that you want to preserve. Then what you can do is replace the whitespace in them by an underscore and use a WhiteSpaceAnalyzer instead of a StandardAnalyzer (which throws away punctuation), perhaps with a LowerCaseFilter.
Writing your own Tokenizer requires quite some Lucene black magic. I've never been able to wrap my head around the Lucene 2.9+ APIs, but check out the TokenStream docs if you really want to try.
I did it by creating the field which is indexed but not analyzed.
For this I used the Field.Index.NOT_ANALYZED
>
doc.add(new Field("fieldName", "value", Field.Store.YES, Field.Index.NOT_ANALYZED, TermVector.YES));
the StandardAnalyzer.
I worked on Lucene 3.0.2.

Need Lucene query optimization advice

Am working on web based Job search application using Lucene.User on my site can search for jobs which are within a radius of 100 miles from say "Boston,MA" or any other location.
Also, I need to show the search results sorted by "relevance"(ie. Score returned by lucene) in descending order.
I'm using a 3rd party API to fetch all the cities within given radius of a city.This API returns me around 864 cities within 100 miles radius of "Boston,MA".
I'm building the city/state Lucene query using the following logic which is part of my "BuildNearestCitiesQuery" method.
Here nearestCities is a hashtable returned by the above API.It contains 864 cities with CityName ass key and StateCode as value.
And finalQuery is a Lucene BooleanQuery object which contains other search criteria entered by the user like:skills,keywords,etc.
foreach (string city in nearestCities.Keys)
{
BooleanQuery tempFinalQuery = finalQuery;
cityStateQuery = new BooleanQuery();
queryCity = queryParserCity.Parse(city);
queryState = queryParserState.Parse(((string[])nearestCities[city])[1]);
cityStateQuery.Add(queryCity, BooleanClause.Occur.MUST); //must is like an AND
cityStateQuery.Add(queryState, BooleanClause.Occur.MUST);
}
nearestCityQuery.Add(cityStateQuery, BooleanClause.Occur.SHOULD); //should is like an OR
finalQuery.Add(nearestCityQuery, BooleanClause.Occur.MUST);
I then input finalQuery object to Lucene's Search method to get all the jobs within 100 miles radius.:
searcher.Search(finalQuery, collector);
I found out this BuildNearestCitiesQuery method takes a whopping 29 seconds on an average to execute which obviously is unacceptable by any standards of a website.I also found out that the statements involving "Parse" take a considerable amount of time to execute as compared to other statements.
A job for a given location is a dynamic attribute in the sense that a city could have 2 jobs(meeting a particular search criteria) today,but zero job for the same search criteria after 3 days.So,I cannot use any "Caching" over here.
Is there any way I can optimize this logic?or for that matter my whole approach/algorithm towards finding all jobs within 100 miles using Lucene?
FYI,here is how my indexing in Lucene looks like:
doc.Add(new Field("jobId", job.JobID.ToString().Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("title", job.JobTitle.Trim(), Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field("description", job.JobDescription.Trim(), Field.Store.NO, Field.Index.TOKENIZED));
doc.Add(new Field("city", job.City.Trim(), Field.Store.YES, Field.Index.TOKENIZED , Field.TermVector.YES));
doc.Add(new Field("state", job.StateCode.Trim(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES));
doc.Add(new Field("citystate", job.City.Trim() + ", " + job.StateCode.Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED , Field.TermVector.YES));
doc.Add(new Field("datePosted", jobPostedDateTime, Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("company", job.HiringCoName.Trim(), Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field("jobType", job.JobTypeID.ToString(), Field.Store.NO, Field.Index.UN_TOKENIZED,Field.TermVector.YES));
doc.Add(new Field("sector", job.SectorID.ToString(), Field.Store.NO, Field.Index.UN_TOKENIZED, Field.TermVector.YES));
doc.Add(new Field("showAllJobs", "yy", Field.Store.NO, Field.Index.UN_TOKENIZED));
Thanks a ton for reading!I would really appreciate your help on this.
Janis
Not quite sure if I completely understand your code, but when it comes to geospatial search a filter approach might be more appropriate. Maybe this link can give you some ideas - http://sujitpal.blogspot.com/2008/02/spatial-search-with-lucene.html
Maybe you can use Filters for other parts of your query as well. To be honest your query looks quite complex.
--Hardy
Apart from tempFinalQuery being unused and an unnecessary map lookup to get the state, there doesn't seem to be anything too egregious in the code you post. Apart from the formatting...
If all the time is taken in the Parse methods, posting their code here would make sense.
I might have missed the point of your question but do you have the possibility of storing latitude and longitude for zip codes? If that is an option, you could then compute the distance between two coordinates providing a much more straightforward scoring metric.
I believe the best approach is to move the the nearest city determination into a search filter. I would also reconsider how you have the field setup; consider creating one term that has city+state so that would simplify the query.
I'd suggest:
storing the latitude and longitude of locations as they come in
when a user enters a city and distance, turn that into a lat/lon value and degrees
do a single, simple lookup based on numerical distance lat/lon comparisons
You can see an example of how this works in the Geo::Distance Perl module. Take a look at the closest method in the source, which implements this lookup via simple SQL.
Agree with the others here that this smells too much. Also doing a textual search on city names is not always that reliable. There is often a bit of subjectivity between place names (particularly areas within a city which might in themselves be large).
Doing a geo spatial query is the way to go. Not knowing the rest of your set up it is hard to advise. You do have Spatial support built into Fluent to NHibernate, and SQL Server 2008 for example. You could then do a search very quickly and efficiently. However, your challenge is to get this working within Lucene.
You could possibly do a "first pass" query using spatial support in SQL Server, and then run those results through Lucene?
The other major benefit of doing spatial queries is that you can then easily sort your results by distance which is a win for your customers.