Using Lucene, I want to compare a document in the index with the rest of documents. I found out that an easy way would be to submit the document as a query. The problem is that I need to put terms as an OR-Ring and, the most difficult part, boost the terms with the term frequency.
I think that if I trim all blank spaces of the document and replace them with ' OR ', lucene will parse it and interpret it. But is there a most sophisticated way to deal with this problem?
And which is the easiest way to boost the terms with their respective frequencies?
It looks like you are trying to re-implement Lucene's MoreLikeThis.
Related
I am trying to change the scoring in apache lucene 5.3, and for my formula I need the document length (the number of tokens in the document). I understood from answers to similar question, you don't have an easy way to do it. because lucene doesn't keep it at the index. so I thought maybe while indexing I will create an Map from docID to the document length, and then use it in query evaluation. But, I have no idea where I should put this map and where I will update it.
You are exactly right, storing this when the document is indexed is the best approach. The place to store it is in the norm (not to be confused with the queryNorm, that's something different). Norms provide a single value stored with the field, which is made available at query time for scoring.
In your Similarity implementation, this should go into the ComputeNorm method, which exposes the information you need through the FieldInvertState, particularly FieldInvertState.getLength(). Norms are made available at search time through LeafReader.GetNormValues.
If you are extending TFIDFSimilarity, instead, you just need to implement the encodeNormValue, decodeNormValue and lengthNorm methods.
I don't quite understand the difference between apache solr's spell check vs fuzzy search functionality.
I understand that fuzzy search matches your search term with the indexed value based on some difference expressed in distance.
I also understand that spellcheck also give you suggestions based on how close your search term is to a value in the index.
So to me those two things are not that different though I am sure that this is due to my shortcoming in understanding each feature thoroughly.
If anyone could provide an explanation preferably via an example, I would greatly appreciate it.
Thanks,
Bob
I'm not a professional in the Solr but I try to explain.
Fuzzy search is a simple instruction for Solr to use a kind of spellchecking during requests - Solr’s standard query parser supports the fuzzy search and you can use this one without any additional settings, for example: roam~ or roam~1. And this so-colled spellcheking is used a Damerau-Levenshtein Distance or Edit Distance algorithm.
To use spellchecking you need to configure it in the solrconfig.xml (please, see here). It gives you sort of flexibility how to implement spellcheking (there are a couple of OOTB implementation) so, for example, you can use another index for spellcheck thereby you decrease load on main index. Also for spellchecking you use another URL: /spell so it is not a search query like fuzzy query.
Why should I use spellcheking or fuzzy search? I guess it is depended on your server loading because the fuzzy search is more expensive and not recommended by the Solr team.
P.S. It is my understanding of fuzzy and spellcheking so if somebody has more correct and clear explanation, please, give us advice how to deal with them.
I have a SentimentAttribute class which extends AttributeImpl. Also I am currently writing a SentenceSentimentTaggingFilter class which should
take InputStream (consisting of text)
tokenize it into sentences
assign a sentiment to each sentence, i.e., by adding SentimentAttribute to it
The problem I currently have is that it seems like there is only functionality inside Lucene which tokenizes text into individual tokens, e.g., single words, but nothing to split into sentences.
What is the best way to integrate this with a regular EnglishAnalyzer I'm also using during indexing? I would like to avoid to process both EnglishAnalyzer and my analysis in parallel but rather hook in my analysis in between the processing steps of the EnglishAnalyzer (assuming that this is the fastest / most efficient way).
Thanks a lot in advance :)
I'm actually doing something very similar but in an earlier version of Lucene, V3.0.2. You may want to look at the following class:
org.apache.lucene.wordnet.AnalyzerUtil
Although you've probably found a way to do this by now. I hope it might help anyway.
I have written a plugin in lucene which annotates certain terms and stores their spans in this fashion <term>,<span>;<term>,<span>;..
Now i need to handle span near queries just using these spans and not the default lucene stored spans. This is because not all terms which are similar are annotated. So basically if i query terms within k tokens, then i should be able to get their span distance by subtracting their corresponding spans. How will i be able to do this in lucene? I'm a newbie, so please be as descriptive as possible.
Thanks,
Ananth.
A good general rule I follow in Lucene is to put specially-processed data into its own fields so there is little chance of a mix-up. In that way, you can perform your nearness queries in the way you want. (This will make your index bigger.)
Is there a way to query a full text index to help determine additional noise words? I would like to add some custom noise words and wondered if theres a way to analyse the index to help determine suggestions.
As simple as in
http://arcanecode.com/2008/05/29/creating-and-customizing-noise-words-in-sql-server-2005-full-text-search/
where this is explained (how to do it). Coming up with proper ones, though, is hard.
I decided to look into lucene.net because I wasn't happy with the relevance calculations in sql server full text indexing.
I managed to figure out how to index all the content pretty quickly and then used Luke to find noise words. I have now edited the sql server noise files based on this analysis. Now I have a search solution that works reasonably well using sql server full text indexing, but I plan to move to lucene.net in the future.
Using sql server full text indexing as a base, I developed a domain centric approach to finding relevant content using tool I understood. After some serious thinking and testing, I used many other measures to determine the relevance of a search result other than what is provided by analysing text content for term frequency and word distance. SQL Server full text indexing provided me a great start, and now I have a strategy I can express using lucene that will work very well.
It would have taken me a whole lot longer to understand lucene, and develop a strategy for the search. If anyone out there is still reading this, use full text indexing for testing your idea and then move to lucene once you have a strategy you know will work for your domain.