how to configure solr / lucene to perform levenshtein edit distance searching? - lucene

i have a long list of words that i put into a very simple SOLR / Lucene database. my goal is to find 'similar' words from the list for single-term queries, where 'similarity' is specifically understood as (damerau) levensthein edit distance. i understand SOLR provides such a distance for spelling suggestions.
in my SOLR schema.xml, i have configured a field type string:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
which i use to define a field
<field name='term' type='string' indexed='true' stored='true' required='true'/>
i want to search this field and have results returned according to their levenshtein edit distance. however, when i run a query like webspace~0.1 against SOLR with debugging and explanations on, the report shows that a whole bunch of considerations went into calculating the scores, e.g.:
"1582":"
1.1353534 = (MATCH) sum of:
1.1353534 = (MATCH) weight(term:webpage^0.8148148 in 1581), product of:
0.08618848 = queryWeight(term:webpage^0.8148148), product of:
0.8148148 = boost
13.172914 = idf(docFreq=1, maxDocs=386954)
0.008029869 = queryNorm
13.172914 = (MATCH) fieldWeight(term:webpage in 1581), product of:
1.0 = tf(termFreq(term:webpage)=1)
13.172914 = idf(docFreq=1, maxDocs=386954)
1.0 = fieldNorm(field=term, doc=1581)
clearly, for my application, term frequencies, idfs and so on are meaningless, as each document only contains a single term. i tried to use the spelling suggestions component, but didn't manage to make it return the actual similarity scores.
can anybody provide hints how to configure SOLR to perform levensthein / jaro-winkler / n-gram searches with scores returned and without doing additional stuff like tf, idf, boost and so included? is there a bare-bones configuration sample for SOLR somewhere? i find the number of options truly daunting.

If you're using a nightly build, then you can sort results based on levenshtein distance using the strdist function:
q=term:webspace~0.1&sort=strdist("webspace", term, edit) desc
More details here and here

Solr/Lucene doesn't appear to be a good fit for this application. You are likely better off. with SimMetrics library . It offers a comprehensive set of string-distance calculators incl. Jaro-Winkler, Levenstein etc.

how to configure SOLR to perform levensthein / jaro-winkler / n-gram
searches with scores returned and without doing additional stuff like
tf, idf, boost and so included?
You've got some solutions of how to obtain the desired results but none actually answeres your question.
q={!func}strdist("webspace",term,edit) will overwrite the default document scoring with the Levenstein distance and q={!func}strdist("webspace",term,jw) does the same for Jaro-Winkler.
The sorting suggested above will work fine in most cases but it doesn't change the scoring function, it just sorts the results obtained with the scoring method you want to avoid. This might lead to different results and the order of the groups might not be the same.
To see which ones would fit best a &debugQuery=true might be enough.

Related

Use MeCab to separate Japanese sentences into words not morphemes in vb.net

I am using the following code to split Japanese sentences into its words:
Dim parameter = New MeCabParam()
Dim tagger = MeCabTagger.Create(parameter)
For Each node In tagger.ParseToNodes(sentence)
If node.CharType > 0 Then
Dim features = node.Feature.Split(",")
Console.Write(node.Surface)
Console.WriteLine(" (" & features(7) & ") " & features(1))
End If
Next
An input of それに応じて大きくになります。 outputs morphemes:
それ (それ) 代名詞
に (に) 格助詞
応じ (おうじ) 自立
て (て) 接続助詞
大きく (おおきく) 自立
に (に) 格助詞
なり (なり) 自立
ます (ます) *
。 (。) 句点
Rather than words like so:
それ
に
応じて
大きく
に
なります
。
Is there a way I can use a parameter to get MeCab to output the latter? I am very new to coding so would appreciate it if you explain simply. Thanks.
This is actually pretty hard to do. MeCab, Kuromoji, Sudachi, KyTea, Rakuten-MA—all of these Japanese parsers and the dictionary databases they consume (IPADIC, UniDic, Neologd, etc.) have chosen to parse morphemes, the smallest units of meaning, instead of what you call "words", which as your example shows often contain multiple morphemes.
There are some strategies that usually folks combine to improve on this.
Experiment with different dictionaries. I've noticed that UniDic is sometimes more consistent than IPADIC.
Use a bunsetsu chunker like J.DepP, which consumes the output of MeCab to chunk together morphemes into bunsetsu. Per this paper, "We use the notion of a bunsetsu which roughly corresponds to a minimum phrase in English and consists of a content words (basically nouns or verbs) and the functional words surrounding them." The bunsetsu output by J.DepP often correspond to "words". I personally don't think of, say, a noun + particle phrase as a "word" but you might—these two are usually in a single bunsetsu. (J.DepP is also pretttty fancy, in that it also outputs a dependency tree between bunsetsu, so you can see which one modifies or is secondary to which other one. See my example.)
A last technique that you shouldn't overlook is scanning the dictionary (JMdict) for runs of adjacent morphemes; this helps find idioms or set phrases. It can get complicated because the dictionary may have a deconjugated form of a phrase in your sentence, so you might have to search both the literal sentence form and the deconjugated (lemma) form of MeCab output.
I have an open-source package that combines all of the above called Curtiz: it runs text through MeCab, chunks them into bunsetsu with J.DepP to find groups of morphemes that belong together, identifies vocabulary by looking them up in the dictionary, separates particles and conjugated phrases, etc. It is likely not going to be useful for you, since I use it to support my activities in learning Japanese and making Japanese learning tools but it shows how the above pieces can be combined to get to what you need in Japanese NLP.
Hopefully that's helpful. I'm happy to elaborate more on any of the above topics.

How to index only words with a minimum length using Apache Lucene 5.3.1?

may someone give me a hint on how to index only words with a minimum length using Apache Lucene 5.3.1?
I've searched through the API but didn't find anything which suits my needs except this, but I couldn't figure out how to use that.
Thanks!
Edit:
I guess that's important info, so here's a copy of my explanation of what I want to achieve from my reply below:
"I don't intend to use queries. I want to create a source code summarization tool for which I created a doc-term matrix using Lucene. Now it also shows single- or double-character words. I want to exclude them so they don't show up in the results as they have little value for a summary. I know I could filter them when outputting the results, but that's not a clean solution imo. An even worse would be to add all combinations of single- or double-character words to the stoplist. I am hoping there is a more elegant way then one of those."
You should use a custom Analyzer with LengthTokeFilter. E.g.
Analyzer ana = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("standard")
.addTokenFilter("lowercase")
.addTokenFilter("length", "min", "4", "max", "50")
.addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
.build();
But it is better to use a stopword (words what occur in almost all documents, like articles for English language) list. This gives a more accurate result.

Search withTermvector position in lucene

It it possible to search for document similarity based on term-vector position in lucene?
For example there are three documents with content as follows
1: Hi how are you
2: Hi how you are
3: Hi how are you
Now if doc 1 is searched in lucene then it should return doc 3 with more score then doc 2 with less score because doc 2 has "you" and "are" words at different positions,
In short lucene should return exact matching documents with term positions
I think what you need is a PhraseQuery, it is a Lucene Query type that will take into account the precise position of your tokens and allow you to define a slop or permutation tolerance regarding those tokens.
In other words the more your tokens differ from the source in terms of positions the less they will be scored.
You can use it like that :
QueryBuilder analyzedBuilder = new QueryBuilder(new MyAnalyzer());
PhraseQuery query = analyzedBuilder.createPhraseQuery("fieldToSearchOn", textQuery);
the createPhraseQuery allows for a third parameter the slop I alluded to if you want to tweak it.
Regards,

Sitecore 7 + Lucene: Query-Time Boosting: how?

for items of a certain template, our users can indicate that the item should be shown on top of the list.
For this, we have added a field in the index "ShowOnTop".
Now when searching for items of this template (to build the list page), we would like to have these "ShowOnTop" items to effectively be returned on top of the other items.
This field however should not affect other site search (general search).
We think this could be possible by applying Query-Time Boosting to these documents. But, how can we achieve this?
To do boosting at query-time simply use Boost(value) method (using a search predicate as it sounds like you might be doing some advanced searching where added flexibility of predicates might come in handy) -
var queryPredicate = PredicateBuilder.True<SearchResult>();
queryPredicate = queryPredicate.And(i =>
i.Headline.Contains(model.Query).Boost(50));
Probably the best way would be to apply a Sort based on that field, something along the lines of:
Sort sort = new Sort(new SortField("ShowOnTop", SortField.STRING, true), true);
var hits = new SearchHits(context.Searcher.Search(query, sort));
You could also add it as a heavily boosted optional query term, something along the lines of, and make the rest of the query is required (as a whole), like:
ShowOnTop:true^10000 +(the rest of the query)
With a large enough boost factor, those terms should always come up first unless there is a really drastic difference in relevance.
Easiest is creating a rule under /sitecore/system/Settings/Rules/Indexing and Search/... that filters on your ShowOnTop field (I used a checkbox and compared the value with 1) and adjust the boost by 99999999
You can either add this rule as Global Rule or you can add it as Item rule and assign the rule from within the item.
Good luck!

How to calculate "OnTopicness" of documents using Lucene.NET

Imagine I have a huge database of threads and posts (about 10.000.000 records) from different forum sites including several subforums that serve as my lucene documents.
Now I am trying to calculate a feature called "OnTopicness" for each post based on the terms used in it. In fact, this feature is not much more than a simple cosine similarity between two document vectors that will be stored in the database and therefore has to be calculated only once per post. :
Forum-OnTopicness: cosine similarity between my post and a virtual
document consisting of all other posts in the specified forum (including
all threads in the forum)
Thread-OnTopicness: cosine similarity between my post and a virtual
document consisting of all other posts in the specified thread
Since the Lucene.NET API doesn't offer a method to calculate a document-document or document-index cosine similarity, I read that I could either parse one of the documents as query and search for the other document in the results or that I could manually calculate the similarity using TermFreqVectors and DocFrequencies.
I tried the second attempt because it sounds faster but ran into a problem: The IndexReader.GetTermFreqVector() method takes the internal docNumber as parameter which I don't know if I just pass two documents to my GetCosineSimilarity method:
public void GetCosineSimilarity(Document doc1, Document doc2)
{
using (IndexReader reader = IndexReader.Open(FSDirectory.Open(indexDir), true))
{
// how do I get the docNumbers?
TermFreqVector tfv1 = reader.GetTermFreqVector(???, "PostBody");
TermFreqVector tfv2 = reader.GetTermFreqVector(???, "PostBody");
...
// assuming that I have the TermFreqVectors, how would I continue here?
}
}
Besides that, how would you create the mentioned "virtual document" for either a whole forum or a thread? Should I just concatenate the PostBody fields of all contained posts and parse them into a new document or can I just create an index them for them and somehow compare my post to this entire index?
As you can see, as a Lucene newbie, I am still not sure about my overall index design and could definitely use some general advice. Help is highly appreciated - thanks!
Take a look at MoreLikeThisQuery in
https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Queries/Similar/
Its source may be useful.
Take a look at S-Space. It is a free open-source Java package that does a lot of the things you want to do, e.g. compute cosine similarity between documents.