Search withTermvector position in lucene - lucene

It it possible to search for document similarity based on term-vector position in lucene?
For example there are three documents with content as follows
1: Hi how are you
2: Hi how you are
3: Hi how are you
Now if doc 1 is searched in lucene then it should return doc 3 with more score then doc 2 with less score because doc 2 has "you" and "are" words at different positions,
In short lucene should return exact matching documents with term positions

I think what you need is a PhraseQuery, it is a Lucene Query type that will take into account the precise position of your tokens and allow you to define a slop or permutation tolerance regarding those tokens.
In other words the more your tokens differ from the source in terms of positions the less they will be scored.
You can use it like that :
QueryBuilder analyzedBuilder = new QueryBuilder(new MyAnalyzer());
PhraseQuery query = analyzedBuilder.createPhraseQuery("fieldToSearchOn", textQuery);
the createPhraseQuery allows for a third parameter the slop I alluded to if you want to tweak it.
Regards,

Related

How to make the first n words more important in Lucene

I want to make the first n (which i set) words from a document more important that the rest of the document in Lucene. How will i do that? I found something about boosting, but boost a field to be more important. My document is supposed to be an only field.
Is to number the words at indexing time and boost them a solution? Something like that:
TextField myField = new TextField("text",termAtt.toString(),Store.YES);
myField.setBoost(2);
document.add(myField);
if the i didn't reach the n-th word in my document?
I want to get the following result: let's say that the first 20 words in a document are more important than the rest. I have 2 identical documents that have more than 20 words and i add the word that i am searching in one document as th first word and in the second document as the last word, an i want that the first document to have a bigger score.
The best approach would be to simply create two different fields, one containing the higher value portion of the text (this wouldn't need to be stored), and the next containing the full text:
int leadinLength = 20
TextField myFieldLeadin = new TextField("text_leadin",termAtt.toString().substring(leadinLength,Store.NO);
TextField myField = new TextField("text, termAtt.toString(),Store.YES);
myFieldLeadin.setBoost(2);
document.add(myFieldLeadin);
document.add(myField);
To could use a MultiFieldQueryParser to streamline searching in both fields at once, if desired, like:
Query query = MultiFieldQueryParser.parse(Version.LUCENE_48, "my search query",{"text_leadin","text"}, analyzer);
TopDocs docs = searcher.search(query, 10);

Elasticsearch: match every position only once

In my Elasticsearch index I have documents that have multiple tokens at the same position.
I want to get a document back when I match at least one token at every position.
The order of the tokens is not important.
How can I accomplish that? I use Elasticsearch 0.90.5.
Example:
I index a document like this.
{
"field":"red car"
}
I use a synonym token filter that adds synonyms at the same positions as the original token.
So now in the field, there are 2 positions:
Position 1: "red"
Position 2: "car", "automobile"
My solution for now:
To be able to ensure that all positions match, I index the maximum position as well.
{
"field":"red car",
"max_position": 2
}
I have a custom similarity that extends from DefaultSimilarity and returns 1 tf(), idf() and lengthNorm(). The resulting score is the number of matching terms in the field.
Query:
{
"custom_score": {
"query": {
"match": {
"field": "a car is an automobile"
}
},
"_script": "_score*100/doc[\"max_position\"]+_score"
},
"min_score":"100"
}
Problem with my solution:
The above search should not match the document, because there is no token "red" in the query string. But it matches, because Elasticsearch counts the matches for car and automobile as two matches and that gives a score of 2 which leads to a script score of 102, which satisfies the "min_score".
If you needed to guarantee 100% matches against the query terms you could use minimum_should_match. This is the more common case.
Unfortunately, in your case, you wish to provide 100% matches of the indexed terms. To do this, you'll have to drop down to the Lucene level and write a custom (java - here's boilerplate you can fork) Similarity class, because you need access to low-level index information that is not exposed to the Query DSL:
Per document/field scanned in the query scorer:
Number of analyzed terms matched (overlap is the Lucene terminology, it is used the the coord() method of the DefaultSimilarity class)
Number of total analyzed terms in the field: Look at this thread for a couple different ways to get this information: How to count the number of terms for each document in lucene index?
Then your custom similarity (you can probably even extend DefaultSimilarity) will need to detect queries where terms matched < total terms and multiply their score by zero.
Since query and index-time analysis have already happened at this level of scoring, the total number of indexed terms will already be expanded to include synonyms, as should the query terms, avoiding the false-positive "a car is an automobile" issue above.

Apache lucene and text meaning

I have a question about searching process in lucene/.
I use this code for search
Directory directory = FSDirectory.GetDirectory(#"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
In one document I've set "I want to go shopping" for a field and in other document I've set "I wanna go shopping".
the meaning of both sentences is same!
is there any good solution for lucene to understand meaning of sentences or kind of normalize the scentences ? for example save the fields like "I wanna /want to/ go shopping" and remove the comment with regexp in result.
Lucene provides filter to normalize words and even map similar words.
PorterStemFilter -
Stemming allows words to be reduced to their roots.
e.g. wanted, wants would be reduced to root want and search for any of those words would match the document.
However, wanna does not reduce to root want. So it may not work in this case.
SynonymFilter -
would help you to map words similar in a configuration file.
so wanna can be mapped to want and if you search for either of those, the document must match.
you would need to add the filters in your analysis chain.

hit highlighting in lucene

i am searching for strings indexed in lucene as documents. now i give it a long string to match.
example:
"iamrohitbanga is a stackoverflow user" search string
documents:
document 1: field value: rohit
document 2: field value: banga
now i use fuzzy matching to find the search strings in the documents.
the 2 documents match. i want to retrieve the position at which the string rohit occurs in the search string. how to do it using lucene java api.
also note that the fuzzy matching would lead to inexact matches also. but i am interested in the position word in the searched string.
the answer to
Finding the position of search hits from Lucene
refers to a website which requires us to download some files from http://www.iq-computing.de and this page does not load.
so could you provide a solution?
Probably this should help:
http://lucene.apache.org/java/2_9_1/api/contrib-highlighter/index.html

SpatialQuery for location based search using Lucene

My lucene index has got latitude and longitudes fields indexed as follows:
doc.Add(new Field("latitude", latitude.ToString() , Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("longitude", longitude.ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED));
I want to retrieve a set of documents from this index whose lat and long values are in a given range.
As you already know, Lat and long could be negative values. How do i correctly store signed decimal numbers in Lucene?
Would the approach mentioned below give correct results or is there any other way to do this?
Term lowerLatitude = new Term("latitude", bounds.South.ToString() );
Term upperLatitude = new Term("latitude", bounds.North.ToString());
RangeQuery latitudeRangeQuery = new RangeQuery(lowerLatitude, upperLatitude, true);
findLocationQuery.Add(latitudeRangeQuery, BooleanClause.Occur.SHOULD);
Term lowerLongitude = new Term("longitude", bounds.West.ToString());
Term upperLongitude = new Term("longitude", bounds.East.ToString());
RangeQuery longitudeRangeQuery = new RangeQuery(lowerLongitude, upperLongitude, true);
findLocationQuery.Add(longitudeRangeQuery, BooleanClause.Occur.SHOULD);
Also,I wanted to know how Lucene's ConstantScoreRangeQuery is better than RangeQuery class.
Am facing another problem in this context:
I've one of the documents in the index with the following 3 cities:
Lyons, IL
Oak Brook, IL
San Francisco, CA
If i give input as "Lyons, IL" then this record comes up.
But if i give San Francisco, CA as input, then it does not.
However, if i store the cities for this document as follows:
San Francisco, CA
Lyons, IL
Oak Brook, IL
and when i give San Francisco, CA as input, then this record shows in the search results.
What i want here is that if i type any of the 3 cities in input,I should get this document in the search results.
Please help me achieve this.
Thanks.
Following up on skaffman's suggestion, you can use the same tile coordinate system used by all the popular map apps. Choose whatever zoom level is granular enough for your needs, and don't forget to pad with leading zeros.
Regarding RangeQuery, it's slower than ConstantScoreRangeQuery and limits the range of values.
Regarding the city-state problem, we can only speculate. But the first things to check are that the indexed terms and the parsed query are what you expect them to be.
I think the best way is to convert/normalize the coordinates as suggested in the previous post. This article does exactly this. It's actually quite nice object orientated code.
Regarding your second problem. I would assume you have some sort of Analyzer problem. Are you using the same Analyzer for indexing and querying? Which tokenizers do you use?
I recommend to use Luke to inspect your generated index to see what tokens are actually searchable.
--Hardy
One option here is to convert the coordinates into a system that doesn't have negative numbers. For example, I've had a similar problem for a google maps webapp for the UK, and I stored UK Easting/Northings (which range from 0 to 7 digits) fields in Lucene alongside the lat/long values. By formatting these eastings/northings with left-padded zeroes, I could do lucene range queries.
Is there a similar coordinate system for the US?