How does Lucene compute multifield score? - lucene

Here's Lucene scoring equation:
score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )
What about multifield scoring?
Does the score gets directly summed or averaged or..?

You can read the details of scoring in Similarity class. In this equation, the parameters are referred in reference to Document when they actually mean Field. So, Term Frequency is the frequency of the term in given field in the document. This automatically takes care of the queries on multiple fields.
KenE's answer above is incorrect. (There is no MAX operator in the equation.) The score for each query on a field adds up to the final score. For the query (name:bill OR gender:male) the result is sum of score for (name:bill) and (gender:male). Typically, the documents which satisfy both these criteria will score higher (due to sum) and come up.

It depends on the operation. If you are doing an OR as in (name:bill OR gender:male), it takes the max of the two. If you are doing an AND, it will do a sum.

Shashikant Kore is correct to say that scores for each field are summed. This, however, is only true before the contribution of the queryNorm and coord factors, meaning the final scores will not likely add up.
Each score is multiplied by the queryNorm factor, which is calculated per query and hence differs for each of (name:bill), (gender:male), and (name:bill OR gender:male). Nor is the queryNorm for the combined query merely the sum of the queryNorms for the two single-term queries. So the scores only sum if you divide each score by the queryNorm factor for that query.
The coord factor may also pay a part: the default scorer multiplies the score by the proportion of query terms that were matched. So you can only rely on summation after accounting for queryNorm where all terms match (or coord is disabled).
You can see exactly how a score is calculated using the explain functionality, available in Solr through the debugQuery=true parameter.

Using lucene's default similarity score, I have used a boolean query and got the final formula as following: (sorry it is in latex)
score(q, d) = \sum_{f \in fields} \sum_{t \in query} idf(t, f) queryNorm(query) \times idf(t, f) tf(t, d, f) fieldNorm(f)

Related

Performing a sparse sum on Mathematica

I want to evaluate a sum in Mathematica of the form
g[[i,j,k,l,m,n]] x g[[o,p,q,r,s,t]] x ( complicated function of the indices )
But all these indices range from 0 to 3, so the total number of cases to sum over is 4^12, which will take an unforgiving amount of time. However, barely any elements of the array g[[i,j,k,l,m,n]] are nonzero -- there are probably around 8 nonzero entries -- so I would like to restrict the sum over {i,j,k,l,m,n,o,p,q,r,s,t} to precisely those combinations of indices for which both factors of g are nonzero.
I can't find a way to do this for summation over multiple indices, where the allowed index choices are particular combinations of {i,j,k,l,m,n} as opposed to specific values of each particular index. Any help appreciated!

How To Calculate Exact 99.9th Percentile in Splunk

Does anyone know how to exactly calculate the 99.9th percentile in Splunk?
I have tried a variety of methods as below, such as exactperc (but this only takes integer percentiles) and perc (but this approximates the result heavily).
base | stats exactperc99(latency) as "99th Percentile", p99.9(latency) as "99.9th Percentile"
Thanks,
James
From the Splunk documentation:
There are three different percentile functions:
perc<X>(Y) (or the abbreviation p<X>(Y)) upperperc<X>(Y)
exactperc<X>(Y) Returns the X-th percentile value of the numeric field
Y. Valid values of X are floating point numbers from 1 to 99, such as
99.95.
Use the perc<X>(Y) function to calculate an approximate threshold,
such that of the values in field Y, X percent fall below the
threshold.
The perc and upperperc functions give approximate values for the
integer percentile requested. The approximation algorithm that is
used, which is based on dynamic compression of a radix tree, provides
a strict bound of the actual value for any percentile. The perc
function returns a single number that represents the lower end of that
range. The upperperc function gives the approximate upper bound. The
exactperc function provides the exact value, but will be very
expensive for high cardinality fields. The exactperc function could
consume a large amount of memory in the search head.
Processes field values as strings.
Examples:
p99.999(response_ms)
p99(bytes_received)
p50(salary) # median

SAGE implementation of discrete logarithm in subgroup of group of units

This is a question related to this one. Briefly, in ElGammal cryptosystem with underlying group the group of units modulo a prime number p I'm told to find a subgroup of index 2 to solve discrete logarithm problem in order to break the system.
Clearly, as the group of units modulo a prime number is cyclic, if x is a generator then x^2 generates a subgroup of index 2. Now, what is a good way of solving discrete logarithm problem on sage? How would I use the result of solving discrete logarithm problem in this subgroup to solve it in the whole group?
Sage knows how to compute discrete logarithms in finite fields:
sage: K = GF(19)
sage: z = K.primitive_element()
sage: a = K.random_element()
sage: b = a.log(z)
sage: z^b == a
True
you can use this functionality to solve the discrete logarithm in the subgroup of index 2
sage: x = z^2
sage: a = K.random_element()^2
sage: a.log(x)
6
This is only a toy example, but note that this is not more efficient than solving the discrete logarithm in the full group 𝔽₁₉*.
It is true that the efficiency of generic algorithms (e.g., Baby step-Giant step, Pollard rho, ...) is directly related to the size of the subgroup; however algorithms used to solve discrete logarithms in finite fields (number field sieve, function field sieve) are mostly insensitive to the size of the multiplicative subgroup, and are in general much more efficient than generic algorithms.

How to calculate average velocity for different acceleration?

I want to calculate average speed of the distance traveled using gps signals.
Is this formula calculates correct avg speed?
avgspeed = totalspeed/count
where count is the no.of gps signals.
If it is wrong,please any one tell me the correct formula.
While that should work, remember that GPS signals can be confused easily if you're in diverse terrain. Therefore, I would not use an arithmetic mean, but compute the median, so outliers (quick jumps) would not have such a big effect on the result.
From Wikipedia (n being the number of signals):
If n is odd then Median (M) = value of ((n + 1)/2)th item term.
If n is even then Median (M) = value of [((n)/2)th item term + ((n)/2
+ 1)th item term ]/2

Lucence SweetSpotSimilarity lengthNorm

http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/misc/SweetSpotSimilarity.html
Implemented as: 1/sqrt( steepness * (abs(x-min) + abs(x-max) - (max-min)) + 1 ) .
This degrades to 1/sqrt(x) when min and max are both 1 and steepness is 0.5
Can anyone explain this formula for me? How steepness is decided and what is exactly referring to?
Any help is appreciated.
With the DefaultSimilarity, the shorter the field in terms of number of tokens, the higher the score.
e.g. if you have two docs, with indexed field values of "the quick brown fox" and "brown fox", respectively, the latter would score higher in a query for "fox".
SweetSpotSimilarity lets you define a "sweet spot" for the length of a field in terms of a range defined by min and max. Field lengths within the range will score equally, and field lengths outside the range will score lower, depending on the distance the length is form the range boundary. "steepness" determines how quickly the score degrades as a function of distance.