Imagine I have a huge database of threads and posts (about 10.000.000 records) from different forum sites including several subforums that serve as my lucene documents.
Now I am trying to calculate a feature called "OnTopicness" for each post based on the terms used in it. In fact, this feature is not much more than a simple cosine similarity between two document vectors that will be stored in the database and therefore has to be calculated only once per post. :
Forum-OnTopicness: cosine similarity between my post and a virtual
document consisting of all other posts in the specified forum (including
all threads in the forum)
Thread-OnTopicness: cosine similarity between my post and a virtual
document consisting of all other posts in the specified thread
Since the Lucene.NET API doesn't offer a method to calculate a document-document or document-index cosine similarity, I read that I could either parse one of the documents as query and search for the other document in the results or that I could manually calculate the similarity using TermFreqVectors and DocFrequencies.
I tried the second attempt because it sounds faster but ran into a problem: The IndexReader.GetTermFreqVector() method takes the internal docNumber as parameter which I don't know if I just pass two documents to my GetCosineSimilarity method:
public void GetCosineSimilarity(Document doc1, Document doc2)
{
using (IndexReader reader = IndexReader.Open(FSDirectory.Open(indexDir), true))
{
// how do I get the docNumbers?
TermFreqVector tfv1 = reader.GetTermFreqVector(???, "PostBody");
TermFreqVector tfv2 = reader.GetTermFreqVector(???, "PostBody");
...
// assuming that I have the TermFreqVectors, how would I continue here?
}
}
Besides that, how would you create the mentioned "virtual document" for either a whole forum or a thread? Should I just concatenate the PostBody fields of all contained posts and parse them into a new document or can I just create an index them for them and somehow compare my post to this entire index?
As you can see, as a Lucene newbie, I am still not sure about my overall index design and could definitely use some general advice. Help is highly appreciated - thanks!
Take a look at MoreLikeThisQuery in
https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Queries/Similar/
Its source may be useful.
Take a look at S-Space. It is a free open-source Java package that does a lot of the things you want to do, e.g. compute cosine similarity between documents.
Related
In Lucene 5.4, I Have a Document and a Query object and I want to score this query for this document using some simillarity function (i.e. BM25).
How can I do this? The way I got the job done was looping over all results of a search and comparing documents with the document I want to evaluate.
To get scoring details for a particular document for a given query, you want to use IndexSearcher.explain(). This provides a lot of useful details about how the scoring algorithm operates. You can get the final score from the Explanation with Explanation.getValue() (at the root node, if you start navigating through with getDetails, those sub-nodes won't return the same value):
IndexSearcher searcher = new IndexSearcher(reader);
//Make sure you set the Similarity to the correct algorithm
searcher.setSimilarity(new BM25Similarity());
Explanation explain = searcher.explain(myQuery, myDocID);
float score = explain.getValue();
I am migrating my code from Lucene 3.5 to Lucene 4.1 but I am having some problems with getting the term vector without indexing.
The problem is, given a text string and an Analyzer, I need to compute the term vector (technically, find the terms and their frequencies tf). Obviously, it can be achieved by writing the index (using IndexWriter) and then reading them back (using IndexReader) but I reckon it would be expensive. Furthermore, I don't need document frequency (df). Thus, I think an indexing-free solution is suitable.
In Lucene 2 and 3, a simple technique for the above purpose is to use QueryTermVector which extends TermFreqVector and has a constructor taking a string and an Analyzer. Unfortunately, QueryTermVector (along with TermFreqVector) has been removed in Lucene 4 and it seems the migration documentation did not mention anything about QueryTermVector.
Do you have a solution for this problem in Lucene 4? Thank you very much.
If you just need to know the terms/frequencies, you can just obtain the single tokens directly from the analyzer (you can get the TF by counting them, e.g. by using a Map or a Multiset).
This is how you do it in Lucene 4.0:
TokenStream ts = analyzer.tokenStream(field, new StringReader(text));
CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);
while (ts.incrementToken()) {
String term = charTermAttribute.toString();
//term contains your token
}
Interview Question
I have been asked this question in an interview, and the answer doesn't have to be specific programming language, platform- or tool- specific.
The question was phrased as following:
How would you get the instance count of a given word in a PDF. The answer doesn't have to be programming, platform, or tool specific. Just let me know how would you do it in a memory and speed efficient way
I am posting this question for following reasons:
To better understand the context - I still fail to understand the context of this question, what might the interviewer be looking for by asking this question?
To get diverse opinions - I tend to answer such questions based on my skills on a programming language (C#), but there might be other valid options to get this done.
Thanks for your interest.
If I had to write a program to do it, I'd find a PDF rendering library capable of extracting text from PDF files, such as Xpdf and then count the words.
If this was a one-of task or something that needed to be automated for a non-production quality task, I'd just feed the file into pdftotext program and then parsed the output file with python, splitting into words, putting them in a dictionary and counting number of occurances.
If I was asking this interviewing question, I'd be looking for a couple of things:
understanding the difference between the setting for this task:
one-off script thingy vs production code
not attempting to
implement PDF rendered yourself and trying to find a library
instead.
Now I wouldn't expect this from any random candidate with no PDF experience, but you can have a very meaningful discussion about what PDF is and what a "word" is. You see, PDF stored text as a bunch of string with coordinates. Each string is not necessarily a word. Often times, the words will be split into a couple of completely separate strings which are absolutely positioned in the document to make a single word. This is why sometimes when searching for words in a PDF document you get strange looking results. So to implement word searching in a document you'd have to glue these strings back together (pdftotext takes care of that for you).
It's not a bad question at all.
You can use Trie It is very easy to get the count of given word.
I would suggest an open source solution using Java. First you would have to parse the pdf file and extract all the text using Tika.
Then I believe the correct question is how to to find the TF(term frequency) of a word in a text. I will not trouble you with definitions because you can achieve this simply by scanning the extracted text and counting the frequency of word.
Sample code would look like this:
while(scan.hasNext())
{
word = scan.next();
ha += (" " + word + " ");
int countWord = 0;
if(!listOfWords.containsKey(word))
{
listOfWords.put(word, 1); //first occurance of this word
}
else
{
countWord = listOfWords.get(word) + 1; //get current count and increment
//now put the new value back in the HashMap
listOfWords.remove(word); //first remove it (can't have duplicate keys)
listOfWords.put(word, countWord); //now put it back with new value
}
}
I am trying to boost certain documents. But they dont get boosted. Please tell me what I am missing. Thanks!
In my index code I have:
if (myCondition)
{
myDocument.SetBoost(1.1f);
}
myIndexWriter.AddDocument(document);
then in my search code I retrieve the collection of documents from the ScoreDocs object into myDocuments collection and:
foreach (Lucene.Net.Documents.Document doc in myDocuments)
{
float tempboost = doc.GetBoost();
}
and I place a breakpoint in the foreach clause to break if the tempboost is not 1. and the breakpoint is never hit.
What did I miss?
Many thanks!
From javadoc of Lucene (Java version but same behaviors apply):
public float getBoost()
Returns, at indexing time, the boost factor as
set by setBoost(float).
Note that once a document is indexed this value is no longer available
from the index. At search time, for retrieved documents, this method
always returns 1. This however does not mean that the boost value set
at indexing time was ignored - it was just combined with other
indexing time factors and stored elsewhere, for better indexing and
search performance.
note: for those of you who get NaN when retrieving the score please use the following line
searcher.SetDefaultFieldSortScoring(true,true);
i have a long list of words that i put into a very simple SOLR / Lucene database. my goal is to find 'similar' words from the list for single-term queries, where 'similarity' is specifically understood as (damerau) levensthein edit distance. i understand SOLR provides such a distance for spelling suggestions.
in my SOLR schema.xml, i have configured a field type string:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
which i use to define a field
<field name='term' type='string' indexed='true' stored='true' required='true'/>
i want to search this field and have results returned according to their levenshtein edit distance. however, when i run a query like webspace~0.1 against SOLR with debugging and explanations on, the report shows that a whole bunch of considerations went into calculating the scores, e.g.:
"1582":"
1.1353534 = (MATCH) sum of:
1.1353534 = (MATCH) weight(term:webpage^0.8148148 in 1581), product of:
0.08618848 = queryWeight(term:webpage^0.8148148), product of:
0.8148148 = boost
13.172914 = idf(docFreq=1, maxDocs=386954)
0.008029869 = queryNorm
13.172914 = (MATCH) fieldWeight(term:webpage in 1581), product of:
1.0 = tf(termFreq(term:webpage)=1)
13.172914 = idf(docFreq=1, maxDocs=386954)
1.0 = fieldNorm(field=term, doc=1581)
clearly, for my application, term frequencies, idfs and so on are meaningless, as each document only contains a single term. i tried to use the spelling suggestions component, but didn't manage to make it return the actual similarity scores.
can anybody provide hints how to configure SOLR to perform levensthein / jaro-winkler / n-gram searches with scores returned and without doing additional stuff like tf, idf, boost and so included? is there a bare-bones configuration sample for SOLR somewhere? i find the number of options truly daunting.
If you're using a nightly build, then you can sort results based on levenshtein distance using the strdist function:
q=term:webspace~0.1&sort=strdist("webspace", term, edit) desc
More details here and here
Solr/Lucene doesn't appear to be a good fit for this application. You are likely better off. with SimMetrics library . It offers a comprehensive set of string-distance calculators incl. Jaro-Winkler, Levenstein etc.
how to configure SOLR to perform levensthein / jaro-winkler / n-gram
searches with scores returned and without doing additional stuff like
tf, idf, boost and so included?
You've got some solutions of how to obtain the desired results but none actually answeres your question.
q={!func}strdist("webspace",term,edit) will overwrite the default document scoring with the Levenstein distance and q={!func}strdist("webspace",term,jw) does the same for Jaro-Winkler.
The sorting suggested above will work fine in most cases but it doesn't change the scoring function, it just sorts the results obtained with the scoring method you want to avoid. This might lead to different results and the order of the groups might not be the same.
To see which ones would fit best a &debugQuery=true might be enough.