hit highlighting in lucene - lucene

i am searching for strings indexed in lucene as documents. now i give it a long string to match.
example:
"iamrohitbanga is a stackoverflow user" search string
documents:
document 1: field value: rohit
document 2: field value: banga
now i use fuzzy matching to find the search strings in the documents.
the 2 documents match. i want to retrieve the position at which the string rohit occurs in the search string. how to do it using lucene java api.
also note that the fuzzy matching would lead to inexact matches also. but i am interested in the position word in the searched string.
the answer to
Finding the position of search hits from Lucene
refers to a website which requires us to download some files from http://www.iq-computing.de and this page does not load.
so could you provide a solution?

Probably this should help:
http://lucene.apache.org/java/2_9_1/api/contrib-highlighter/index.html

Related

Search withTermvector position in lucene

It it possible to search for document similarity based on term-vector position in lucene?
For example there are three documents with content as follows
1: Hi how are you
2: Hi how you are
3: Hi how are you
Now if doc 1 is searched in lucene then it should return doc 3 with more score then doc 2 with less score because doc 2 has "you" and "are" words at different positions,
In short lucene should return exact matching documents with term positions
I think what you need is a PhraseQuery, it is a Lucene Query type that will take into account the precise position of your tokens and allow you to define a slop or permutation tolerance regarding those tokens.
In other words the more your tokens differ from the source in terms of positions the less they will be scored.
You can use it like that :
QueryBuilder analyzedBuilder = new QueryBuilder(new MyAnalyzer());
PhraseQuery query = analyzedBuilder.createPhraseQuery("fieldToSearchOn", textQuery);
the createPhraseQuery allows for a third parameter the slop I alluded to if you want to tweak it.
Regards,

Apache lucene and text meaning

I have a question about searching process in lucene/.
I use this code for search
Directory directory = FSDirectory.GetDirectory(#"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
In one document I've set "I want to go shopping" for a field and in other document I've set "I wanna go shopping".
the meaning of both sentences is same!
is there any good solution for lucene to understand meaning of sentences or kind of normalize the scentences ? for example save the fields like "I wanna /want to/ go shopping" and remove the comment with regexp in result.
Lucene provides filter to normalize words and even map similar words.
PorterStemFilter -
Stemming allows words to be reduced to their roots.
e.g. wanted, wants would be reduced to root want and search for any of those words would match the document.
However, wanna does not reduce to root want. So it may not work in this case.
SynonymFilter -
would help you to map words similar in a configuration file.
so wanna can be mapped to want and if you search for either of those, the document must match.
you would need to add the filters in your analysis chain.

How can I perform a fuzzy search for all words provided in a Lucene.net search

I am trying to teach myself Lucene.Net to implement on my site. I understand how to do almost everything I need except for one issue. I am trying to figure out how to allow a fuzzy search for all search terms in a search string.
So for example if I have a document with the string The big red fox, I am trying to get bag fix to match it.
The problem is, it seems like in order to perform fuzzy searches, I have to add ~ to every search term the user enters. I am unsure of the best way to go about this. Right now I am attempting this by
string queryString = "bag rad";
queryString = queryString.Replace("~", string.Empty).Replace(" ", "~ ") + "~";
The first replace is due to Lucene.Net throwing an exception if the search string has a ~ already, apparently it can't handle ~~ in a phrase. This method works, but it seems like it will get messy if I start adding fuzzy weight values.
Is there a better way to default all words to allow for fuzzyness?
You might want to index your documents as bi-grams or tri-grams. Take a look at the CJKAnalyzer to see how they do it. You will want to download the source and look at the source.

How to get the search words location from Sphinx search engine?

I use Sphinx to index HTML pages, giving different weights to title, description, etc. I'm looking for a way to get the search words location in the page from the results that I get from Sphinx.
Meaning, if the wordset is "stack overflow program" and I have 5 documents that match, each of them was a match because it contained at least one word from the wordset.
The question is: how do I know where each word was found in a document? For example, I want to know if document 1 returned because it contained "overflow" in the title and "stack" in the description.
I see that the result returns with a certain weight (3780, for example) but I can't conclude from that on what word was found where.
Thanks a lot!
You'll have to (somehow) get the results back programmatically, and then you can call BuildExcerpts on the contents. Sphinx will then give you an HTML block with the relative positions of the found text.

How can I highlight the search terms with DB2 Text Search in 9.5.2?

I have a working search query with the new Text Search functions in DB2 9.5.2:
select *
from prevue.mysearchabletable mst
where contains(mst.searchabletext, 'porsche') = 1;
However, this only tells me which ones match, as well as what the score is (score function has been omitted for brevity). Is there a way to use the text search in 9.5.2 to highlight search terms?
I know there were functions for this in the old Net Search Extender, but I haven't found equivalents (or a workaround) in the new text search system yet.