compute term vector without indexing in lucene 4

compute term vector without indexing in lucene 4 - lucene

I am migrating my code from Lucene 3.5 to Lucene 4.1 but I am having some problems with getting the term vector without indexing.
The problem is, given a text string and an Analyzer, I need to compute the term vector (technically, find the terms and their frequencies tf). Obviously, it can be achieved by writing the index (using IndexWriter) and then reading them back (using IndexReader) but I reckon it would be expensive. Furthermore, I don't need document frequency (df). Thus, I think an indexing-free solution is suitable.
In Lucene 2 and 3, a simple technique for the above purpose is to use QueryTermVector which extends TermFreqVector and has a constructor taking a string and an Analyzer. Unfortunately, QueryTermVector (along with TermFreqVector) has been removed in Lucene 4 and it seems the migration documentation did not mention anything about QueryTermVector.
Do you have a solution for this problem in Lucene 4? Thank you very much.

If you just need to know the terms/frequencies, you can just obtain the single tokens directly from the analyzer (you can get the TF by counting them, e.g. by using a Map or a Multiset).
This is how you do it in Lucene 4.0:
TokenStream ts = analyzer.tokenStream(field, new StringReader(text));
CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);
while (ts.incrementToken()) {
String term = charTermAttribute.toString();
//term contains your token
}

Related

Lucene substring matching

What is the best way to check if String a is part of String b in Lucene. For example: a = "capital" and b = "Berlin is a capital of Germany". In this case b contains a and fits the requirement.

I think your problem can be treated as some field contains certain term or not.
The basic TermQuery should be enough to solve your problem, in most analyzers, "Berlin is a capital of Germany" will be analyzed as "berlin", "capital" "germany"(if you use the basic stop words)
// code in Scala
new TermQuery(new Term("contents", "capital"))
you can also use PhraseQuery to solve your problem(though, your problem is not the most suitable scenario for PhraseQuery).
val query = new PhraseQuery();
query.add(new Term("contents", "capital"))
Lucene In Action 2nd, 3.4 Lucene’s diverse queries introduces all kinds of Query used in Lucene. I suggest you have a read and that might help.

Using Lucene when indexing I boost certain documents, but their score at search is still 1

I am trying to boost certain documents. But they dont get boosted. Please tell me what I am missing. Thanks!
In my index code I have:
if (myCondition)
{
myDocument.SetBoost(1.1f);
}
myIndexWriter.AddDocument(document);
then in my search code I retrieve the collection of documents from the ScoreDocs object into myDocuments collection and:
foreach (Lucene.Net.Documents.Document doc in myDocuments)
{
float tempboost = doc.GetBoost();
}
and I place a breakpoint in the foreach clause to break if the tempboost is not 1. and the breakpoint is never hit.
What did I miss?
Many thanks!

From javadoc of Lucene (Java version but same behaviors apply):
public float getBoost()
Returns, at indexing time, the boost factor as
set by setBoost(float).
Note that once a document is indexed this value is no longer available
from the index. At search time, for retrieved documents, this method
always returns 1. This however does not mean that the boost value set
at indexing time was ignored - it was just combined with other
indexing time factors and stored elsewhere, for better indexing and
search performance.
note: for those of you who get NaN when retrieving the score please use the following line
searcher.SetDefaultFieldSortScoring(true,true);

How to calculate "OnTopicness" of documents using Lucene.NET

Imagine I have a huge database of threads and posts (about 10.000.000 records) from different forum sites including several subforums that serve as my lucene documents.
Now I am trying to calculate a feature called "OnTopicness" for each post based on the terms used in it. In fact, this feature is not much more than a simple cosine similarity between two document vectors that will be stored in the database and therefore has to be calculated only once per post. :
Forum-OnTopicness: cosine similarity between my post and a virtual
document consisting of all other posts in the specified forum (including
all threads in the forum)
Thread-OnTopicness: cosine similarity between my post and a virtual
document consisting of all other posts in the specified thread
Since the Lucene.NET API doesn't offer a method to calculate a document-document or document-index cosine similarity, I read that I could either parse one of the documents as query and search for the other document in the results or that I could manually calculate the similarity using TermFreqVectors and DocFrequencies.
I tried the second attempt because it sounds faster but ran into a problem: The IndexReader.GetTermFreqVector() method takes the internal docNumber as parameter which I don't know if I just pass two documents to my GetCosineSimilarity method:
public void GetCosineSimilarity(Document doc1, Document doc2)
{
using (IndexReader reader = IndexReader.Open(FSDirectory.Open(indexDir), true))
{
// how do I get the docNumbers?
TermFreqVector tfv1 = reader.GetTermFreqVector(???, "PostBody");
TermFreqVector tfv2 = reader.GetTermFreqVector(???, "PostBody");
...
// assuming that I have the TermFreqVectors, how would I continue here?
}
}
Besides that, how would you create the mentioned "virtual document" for either a whole forum or a thread? Should I just concatenate the PostBody fields of all contained posts and parse them into a new document or can I just create an index them for them and somehow compare my post to this entire index?
As you can see, as a Lucene newbie, I am still not sure about my overall index design and could definitely use some general advice. Help is highly appreciated - thanks!

Take a look at MoreLikeThisQuery in
https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Queries/Similar/
Its source may be useful.

Take a look at S-Space. It is a free open-source Java package that does a lot of the things you want to do, e.g. compute cosine similarity between documents.

Lucene synonym expansion,stemming,spell check and more

I am using Lucene to index my database and then perform a phrase search on a specific field(field name: keyword).
I am using following code currently:
String userQuery = request.getParameter("query");
//create standard analyzer object
analyzer = new StandardAnalyzer(Version.LUCENE_30);
Analyzer analyze=AnalyzerUtil.getPorterStemmerAnalyzer(analyzer);
//create File object of our index directory
File file = new File(LUCENE_INDEX_DIRECTORY);
//create index reader object
reader = IndexReader.open(FSDirectory.open(file),true);
//create index searcher object
searcher = new IndexSearcher(reader);
//create topscore document collector
collector = TopScoreDocCollector.create(1000, false);
//create query parser object
parser = new QueryParser(Version.LUCENE_30,"keyword", analyze);
parser.setAllowLeadingWildcard(true);
//parse the query and get reference to Query object
query = parser.parse(userQuery);
//********Line 1***********************
//search the query
searcher.search(query, collector);
hits = collector.topDocs().scoreDocs;
//check whether the search returns any result
if(hits.length>0){//Code to retrieve hits}
This code works fine for stemming, but now I want to also expand my query to do synonym search like if I enter "Man" and my lucene index has a entry "male", it would still be able to give me that as a hit.
I tried to add this at Line 1 in the above code query=SynExpand.expand(userQuery,
searcher, analyze,"keyword",serialVersionUID);
But it doesn't give me any result.
I also want to introduce spell check, where in if I enter "ubelievable" instead of "unbelievable" it would still give me a result.
I have no idea why synonym expansion isn't working for me and how to do spelling check.Please if someone could guide me I will be really grateful.
Thanks!

Fuzzy search may be done by query keyword modifier, namely by adding tilde:
keyword:ubelievable~
See Lucene Parser Syntax for more details and other types of queries that may be interesting to you.
There are 2 ways of dealing with synonyms. Query expansion you are trying to use relies on WordNet. As SynExpand's documentation says, you should first invoke Syns2Index to use expansion. This is easy way, but it works only with English words.
If you need to add support for multiple languages or add your own synonyms, you can use synonym injection during indexing. The idea is to write your own analyzer that will inject synonyms from your own dictionary into indexed documents. This may sound hard to implement, but fortunately there's excellent example in Lucene in Action book (source code is available for free, see lia.analysis.synonym package. Though, I highly recommend to get your copy of this nice book).

Parse a search string (into NHibernate Criterias )

I would like to implement an advanced search for my project.
The search right now uses all the strings the user enters and makes one big disjunction with criteria API.
This works fine, but now I would like to implement more features: AND, OR and brackets()
I have got a hard time parsing the string - and building criterias from the string. I have found this Stackoverflow question, but it didn't really help (he didn't make it clear what he wanted).
I found another article, but this supports much more and spits out sql statements.
Another thing I've heard mention a lot is Lucene - but I'm not sure if this really would help me.
I've been searching around a little bit and I've found the Lucene.Net WhitespaceAnalyzer and the QueryParser.
It changes the search A AND B OR C into something like +A +B C, which is a good step in the correct direction (plus it handles brackets).
The next step would be to get the converted string into a set of conjunctions and disjunctions.
The Java example I found was using the query builder which I couldn't find in NHibernate.
Any more ideas ?

Guess you haven't heard about Nhibernate Search till now
Nhibernate Search uses lucene underneath and gives u all the options of using AND, OR, grammar.
All you have to do is attribute your entities for indexing and Nhibernate will index it at a predefined location.
Next time you can search this index with the power that lucene exposes and then get your domain level entity objects in return.
using (IFullTextSession s = Search.CreateFullTextSession(sf.OpenSession(new SearchInterceptor()))) {
QueryParser qp = new QueryParser("id", new StopAnalyzer());
IQuery NHQuery = s.CreateFullTextQuery(qp.Parse("Summary:series"), typeof(Book));
IList result = NHQuery.List();
Powerful, isn’t it?

What I am basically doing right now is parsing the input string with the Lucene.Net parse API.
This gives me a uniform and simplified syntax. (Pseudocode)
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
void Function Search (string queryString)
{
Analyzer analyzer = new WhitespaceAnalyzer();
QueryParser luceneParser = new QueryParser("name", analyzer);
Query luceneQuery = luceneParser.Parse(queryString);
string[] words = luceneQuery.ToString().Split(' ');
foreach (string word in words)
{
//Parsing the lucene.net string
}
}
After that I am parsing this string manually, creating the disjunctions and conjunctions.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

compute term vector without indexing in lucene 4 - lucene

Related

Lucene substring matching

Using Lucene when indexing I boost certain documents, but their score at search is still 1

How to calculate "OnTopicness" of documents using Lucene.NET

Lucene synonym expansion,stemming,spell check and more

Parse a search string (into NHibernate Criterias )

Categories

Resources