Lucene tfidf does not have square of idf? - lucene

In Lucene8.2.0 source code TFIDFSimilarity.TFIDFScorer#score(float freq, long norm),
I cannot find squre of idf, but according to the documentation Lucene Practical Scoring Function,
there should involve squre of idf when calculate score.
This mismatching really confuses me a lot, does the documentation not match the source code or I've just misunderstood the source code? Could someone explain it please? Thanks in advance

Related

Sample size calculation for PKPD Modelling

I am trying to find a code for sample size calculation for a PKPD modelling analysis. I found an article that seems to answer what I need but does not have the code: 10.1007/s10928-005-0078-3
J. Can anyone help me please?
Only found a reference. I am expecting to be able to calculate a sample size for a PKPD model study.

Evaluation in continuous optimization problems

I'm trying to understand continuous optimization algorithms applied on some test functions.
Here are the results obtaind by some algorithms used for this issue on some of test functions :
enter image description here
I didn't understand the difference between the two underlined phrases. would you please help me in this?
P.S. sometimes they use the term (median number) instead of (mean number ) what's the difference between the two??
This question lacks some context. It would have been better to link to th text too to get a grasp of what is going on.
But i read it as this (and i think that's how someone with some experience in optimization-algorithms would read it; you have to check it though with your knowledge of the context):
The bold 1.0s are the normalized number of function-evaluations on different functions to optimize (each row is a different function)
The values in the brackets are unnormalized numbers explaining the same
When ACO used 820 evaluations (unnormalized), normalized to 1.0, CACO used 8.3 * 820 evaluations
The mean and median are two different measures of central tendency. Check wikipedia out to understand the differences.

Pairwise cosine similarity

I'm a little confused when I read this paper:Pairwise Document Similarity in Large Collections with MapReduce
http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
In this paper, the author seems didn't consider word only appears in one document, but according to the definition of cosine similarity, we need to consider this situation, right?
The material I used is this: https://www.dropbox.com/s/nctb66hh84ab32c/postings-Reuters-data
The java code I used is this: https://www.dropbox.com/s/aklviixup4uulmu/CosineSimilarity.java
And the results I generated is this: https://www.dropbox.com/s/ea6ov7l7yut7yfj/part-00000
In the results, I see a lot of 1's and even number bigger than 1. I think it's kind of weird, could someone help me find out the reason? Thanks.

Lucene: Setting minimum required similarity on searches

I'm having a lot of trouble dealing with Lucene's similarity factor. I want it to apply a similarity factor different than its default (which is 0.5 according to documentation), but it doesn't seem to be working.
When I type a query that explicitly sets the required similarity factor, like [tinberland~0.5] (notice that I wrote tiNberland, with an "N", while the correct would be with an "M"), it brings many products by the Timberland manufacturer. But when I just type [tinberland] (no similarity factor explicitly defined) and try to set the similarity via code, it doesn't work (returns no results).
The code I wrote to set the similarity is like:
multiFieldQueryParser.SetFuzzyMinSim(0.5F);
And I didn't change the Similarity algorithm, so it is using the DefaultSimilarity class.
Isn't that the correct or recommended way of applying similarity via code? Is there a specific QueryParser for fuzzy queries?
Any help is highly appreciated.
Thanks in advance!
What you are setting is the minimal similarity, so e.g. if someone searched for foo~.1 the parser would change it to foo~.5. It's not saying "turn every query into a fuzzy query."
You can use MultiFieldQueryParser.getFuzzyQuery like so:
Query q = parser.getFuzzyQuery(field, term, minSimilarity);
but that will of course require you calling getFuzzyQuery for each field. I'm not aware of a "MultiFieldFuzzyQueryParser" class, but all it would do is just combine a bunch of those getFuzzyQuery calls.

Search for (Very) Approximate Substrings in a Large Database

I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that Lucene could do it, but is Lucene's levenshtein algorithm fast enough for hundreds of edits? Perhaps something out of the world of plagiarism detection? Any advice is appreciated.
Q-grams could be one approach, but there are others such as Blast, BlastP - which are used for Protein, nucleotide matches etc.
The Simmetrics library is a comprehensive collection of string distance approaches.
Lucene does not seem to be the right tool here. In addition to Mikos' fine suggestions, I have heard about AGREP, FASTA and Locality-Sensitive Hashing(LSH). I believe that an efficient method should first prune the search space heavily, and only then do more sophisticated scoring on the remaining candidates.