Lucence SweetSpotSimilarity lengthNorm

Lucence SweetSpotSimilarity lengthNorm - lucene

http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/misc/SweetSpotSimilarity.html
Implemented as: 1/sqrt( steepness * (abs(x-min) + abs(x-max) - (max-min)) + 1 ) .
This degrades to 1/sqrt(x) when min and max are both 1 and steepness is 0.5
Can anyone explain this formula for me? How steepness is decided and what is exactly referring to?
Any help is appreciated.

With the DefaultSimilarity, the shorter the field in terms of number of tokens, the higher the score.
e.g. if you have two docs, with indexed field values of "the quick brown fox" and "brown fox", respectively, the latter would score higher in a query for "fox".
SweetSpotSimilarity lets you define a "sweet spot" for the length of a field in terms of a range defined by min and max. Field lengths within the range will score equally, and field lengths outside the range will score lower, depending on the distance the length is form the range boundary. "steepness" determines how quickly the score degrades as a function of distance.

Related

How do I calculate the sum efficiently?

Given an integer n such that (1<=n<=10^18)
We need to calculate f(1)+f(2)+f(3)+f(4)+....+f(n).
f(x) is given as :-
Say, x = 1112222333,
then f(x)=1002000300.
Whenever we see a contiguous subsequence of same numbers, we replace it with the first number and zeroes all behind it.
Formally, f(x) = Sum over all (first element of the contiguous subsequence * 10^i ), where i is the index of first element from left of a particular contiguous subsequence.
f(x)=1*10^9 + 2*10^6 + 3*10^2 = 1002000300.
In, x=1112222333,
Element at index '9':-1
and so on...
We follow zero based indexing :-)
For, x=1234.
Element at index-'0':-4,element at index -'1':3,element at index '2':-2,element at index 3:-1
How to calculate f(1)+f(2)+f(3)+....+f(n)?
I want to generate an algorithm which calculates this sum efficiently.

There is nothing to calculate.
Multiplying each position in the array od numbers will yeild thebsame number.
So all you want to do is end up with 0s on a repeated number
IE lets populate some static values in an array in psuedo code
$As[1]='0'
$As[2]='00'
$As[3]='000'
...etc
$As[18]='000000000000000000'```
these are the "results" of 10^index
Given a value n of `1234`
```1&000 + 2&00 +3 & 0 + 4```
Results in `1234`
So, if you are putting this on a chip, then probably your most efficient method is to do a bitwise XOR between each register and the next up the line as a single operation
Then you will have 0s in all the spots you care about, and just retrive the values in the registers with a 1
In code, I think it would be most efficient to do the following
```$n = arbitrary value 11223334
$x=$n*10
$zeros=($x-$n)/10```
Okay yeah we can just do bit shifting to get a value like 100200300400 etc.

To approach this problem, it could help to begin with one digit numbers and see what sum you get.
I mean like this:
Let's say, we define , then we have:
F(1)= 45 # =10*9/2 by Euler's sum formula
F(2)= F(1)*9 + F(1)*100 # F(1)*9 is the part that comes from the last digit
# because for each of the 10 possible digits in the
# first position, we have 9 digits in the last
# because both can't be equal and so one out of ten
# becomse zero. F(1)*100 comes from the leading digit
# which is multiplied by 100 (10 because we add the
# second digit and another factor of 10 because we
# get the digit ten times in that position)
If you now continue with this scheme, for k>=1 in general you get
F(k+1)= F(k)*100+10^(k-1)*45*9
The rest is probably straightforward.
Can you tell me, which Hackerrank task this is? I guess one of the Project Euler tasks right?

How To Calculate Exact 99.9th Percentile in Splunk

Does anyone know how to exactly calculate the 99.9th percentile in Splunk?
I have tried a variety of methods as below, such as exactperc (but this only takes integer percentiles) and perc (but this approximates the result heavily).
base | stats exactperc99(latency) as "99th Percentile", p99.9(latency) as "99.9th Percentile"
Thanks,
James

From the Splunk documentation:
There are three different percentile functions:
perc<X>(Y) (or the abbreviation p<X>(Y)) upperperc<X>(Y)
exactperc<X>(Y) Returns the X-th percentile value of the numeric field
Y. Valid values of X are floating point numbers from 1 to 99, such as
99.95.
Use the perc<X>(Y) function to calculate an approximate threshold,
such that of the values in field Y, X percent fall below the
threshold.
The perc and upperperc functions give approximate values for the
integer percentile requested. The approximation algorithm that is
used, which is based on dynamic compression of a radix tree, provides
a strict bound of the actual value for any percentile. The perc
function returns a single number that represents the lower end of that
range. The upperperc function gives the approximate upper bound. The
exactperc function provides the exact value, but will be very
expensive for high cardinality fields. The exactperc function could
consume a large amount of memory in the search head.
Processes field values as strings.
Examples:
p99.999(response_ms)
p99(bytes_received)
p50(salary) # median

How to calculate average velocity for different acceleration?

I want to calculate average speed of the distance traveled using gps signals.
Is this formula calculates correct avg speed?
avgspeed = totalspeed/count
where count is the no.of gps signals.
If it is wrong,please any one tell me the correct formula.

While that should work, remember that GPS signals can be confused easily if you're in diverse terrain. Therefore, I would not use an arithmetic mean, but compute the median, so outliers (quick jumps) would not have such a big effect on the result.
From Wikipedia (n being the number of signals):
If n is odd then Median (M) = value of ((n + 1)/2)th item term.
If n is even then Median (M) = value of [((n)/2)th item term + ((n)/2
+ 1)th item term ]/2

Weighted random letter in Objective-C

I need a simple way to randomly select a letter from the alphabet, weighted on the percentage I want it to come up. For example, I want the letter 'E' to come up in the random function 5.9% of the time, but I only want 'Z' to come up 0.3% of the time (and so on, based on the average occurrence of each letter in the alphabet). Any suggestions? The only way I see is to populate an array with, say, 10000 letters (590 'E's, 3 'Z's, and so on) and then randomly select an letter from that array, but it seems memory intensive and clumsy.

Not sure if this would work, but it seems like it might do the trick:
Take your list of letters and frequencies and sort them from
smallest frequency to largest.
Create a 26 element array where each element n contains the sum of all previous weights and the element n from the list of frequencies. Make note of the sum in the
last element of the array
Generate a random number between 0 and the sum you made note of above
Do a binary search of the array of sums until you reach the element where that number would fall
That's a little hard to follow, so it would be something like this:
if you have a 5 letter alphabet with these frequencies, a = 5%, b = 20%, c = 10%, d = 40%, e = 25%, sort them by frequency: a,c,b,e,d
Keep a running sum of the elements: 5, 15, 35, 60, 100
Generate a random number between 0 and 100. Say it came out 22.
Do a binary search for the element where 22 would fall. In this case it would be between element 2 and 3, which would be the letter "b" (rounding up is what you want here, I think)

You've already acknowledged the tradeoff between space and speed, so I won't get into that.
If you can calculate the frequency of each letter a priori, then you can pre-generate an array (or dynamically create and fill an array once) to scale up with your desired level of precision.
Since you used percentages with a single digit of precision after the decimal point, then consider an array of 1000 entries. Each index represents one tenth of one percent of frequency. So you'd have letter[0] to letter[82] equal to 'a', letter[83] to letter[97] equal to 'b', and so on up until letter[999] equal to 'z'. (Values according to Relative frequencies of letters in the English language)
Now generate a random number between 0 and 1 (using whatever favourite PRNG you have, assuming uniform distribution) and multiply the result by 1000. That gives you the index into your array, and your weighted-random letter.

Use the method explained here. Alas this is for Python but could be rewritten for C etc.
https://stackoverflow.com/a/4113400/129202

First you need to make a NSDicationary of the letters and their frequencies;
I'll explain it with an example:
let's say your dictionary is something like this:
{#"a": #0.2, #"b", #0.5, #"c": #0.3};
So the frequency of you letters covers the interval of [0, 1] this way:
a->[0, 0.2] + b->[0.2, 0.7] + c->[0.7, 1]
You generate a random number between 0 and 1. Then easily by checking that this random belongs to which interval and returning the corresponding letter you get what you want.
you seed the random function at the beginning of you program: srand48(time(0));
-(NSSting *)weightedRandomForDicLetters:(NSDictionary *)letterFreq
{
double randomNumber = drand48();
double endOfInterval = 0;
for (NSString *letter in dic){
endOfInterval += [[letterFreq objectForKey:letter] doubleValue];
if (randomNumber < endOfInterval) {
return letter;
}
}
}

How does Lucene compute multifield score?

Here's Lucene scoring equation:
score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )
What about multifield scoring?
Does the score gets directly summed or averaged or..?

You can read the details of scoring in Similarity class. In this equation, the parameters are referred in reference to Document when they actually mean Field. So, Term Frequency is the frequency of the term in given field in the document. This automatically takes care of the queries on multiple fields.
KenE's answer above is incorrect. (There is no MAX operator in the equation.) The score for each query on a field adds up to the final score. For the query (name:bill OR gender:male) the result is sum of score for (name:bill) and (gender:male). Typically, the documents which satisfy both these criteria will score higher (due to sum) and come up.

It depends on the operation. If you are doing an OR as in (name:bill OR gender:male), it takes the max of the two. If you are doing an AND, it will do a sum.

Shashikant Kore is correct to say that scores for each field are summed. This, however, is only true before the contribution of the queryNorm and coord factors, meaning the final scores will not likely add up.
Each score is multiplied by the queryNorm factor, which is calculated per query and hence differs for each of (name:bill), (gender:male), and (name:bill OR gender:male). Nor is the queryNorm for the combined query merely the sum of the queryNorms for the two single-term queries. So the scores only sum if you divide each score by the queryNorm factor for that query.
The coord factor may also pay a part: the default scorer multiplies the score by the proportion of query terms that were matched. So you can only rely on summation after accounting for queryNorm where all terms match (or coord is disabled).
You can see exactly how a score is calculated using the explain functionality, available in Solr through the debugQuery=true parameter.

Using lucene's default similarity score, I have used a boolean query and got the final formula as following: (sorry it is in latex)
score(q, d) = \sum_{f \in fields} \sum_{t \in query} idf(t, f) queryNorm(query) \times idf(t, f) tf(t, d, f) fieldNorm(f)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucence SweetSpotSimilarity lengthNorm - lucene

Related

How do I calculate the sum efficiently?

How To Calculate Exact 99.9th Percentile in Splunk

How to calculate average velocity for different acceleration?

Weighted random letter in Objective-C

How does Lucene compute multifield score?

Categories

Resources