What is Weaviate scoring approach? - cosine-similarity

I observed that Weaviate score isn't the same as cosine similarity. I'd appreciate it if you could give any resources on Weaviate's scoring approach.

cosine_sim = 2*certainty - 1 also see this link in Weaviate's FAQ

Related

Population size in Fast Messy Genetic Algorithm

I'm trying to implement the Fast Messy GA using the paper by Goldberg, Deb, Kargupta Harik: fmGA - Rapid Accurate Optimization of Difficult Problems using Fast Messy Genetic Algorithms.
I'm stuck with the formula about the initial population size to account for the Building Block evaluation noise:
The sub-functions here are m=10 order-3(k=3) deceptive functions:
l=30, l'=27 and B is signal-to-noise ratio which is the ratio of the fitness deviation to the difference between the best and second best fitness value(30-28=2). Fitness deviation according to the table above is sqrt(155).
However in the paper they say using 10 order-3 subfunctions and using the equation must give you population size 3,331 but after substitution I can't reach it since I am not sure what is the value of c(alpha).
Any help will be appreciated. Thank you
I think I've figured it out what exactly is c(alpha). At least the graph drawing it against alpha looks exactly the same as in the paper. It seems by the square of the ordinate they mean the square of the Z-score found by Inverse Normal Random Distribution using alpha as the right-tail area. At first I was missleaded that after finding the Z-score it should be substituted in the Normal Random Distribution equation to fight the height(ordinate).
There is some implementation in Lua here https://github.com/xenomeno/GA-Messy for the interested folks. However the Fast Messy GA has some problems reproducing the figures from the original Goldberg's paper which I am not sure how to fix but these is another matter.

RandomForest decision confidence?

I'm using accord.net's RandomForestLearning on some data, and have it predicting results correctly, but what I'd really like is a way to look at the decision confidence that goes along with the plain classification results?
In the end I manually compute the confidence by summing the votes for each label from the component DecsionTrees and the dividing the maximal vote by the total votes. Would be nice if there was an official way, though.

How to combine TF-IDF with edit distance or Jaro-winkler distance

I am looking for ways to improve the accuracy of TF-IDF weighing scheme in string matching (similarity). The main issue is that TF-IDF is sensitive to typographical errors in stings, and most large datasets tend to have typos.
I realised variants of edit distance (character-based similarity metrics---levienshtein, affine-gas, Jaro and Jaro-winkler) are suitable for computing similarity between strings where there are typographical errors, but not suitable when words are out of order in strings.
Hence I would like to use edit distance correcting ability to enhance the accuracy of TF-IDF.
Any ideas on how to address this challenge will be highly appreciated.
Thanks in advance.
There is a paper published by CMU researchers in 2003 and they have explained how to combine TFIDF with Jaro-Winkler:
https://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf
Their Java code is also available on sourceforge as secondString project:
https://sourceforge.net/projects/secondstring/
Here is a link to Javadocs:
http://secondstring.sourceforge.net/javadoc/
The secondString project page:
http://secondstring.sourceforge.net/

Weighted Bipartite Matching covering one partition

I have a problem here, that I managed to reduce to a weighted bipartite match problem. Basically, I have a bipartite graph with partitions A and B, and a set of edges with weights. In my case, |A|~=20 and |B| =300.
I want to find a set of edges which minimizes the weigths AND COVERS 'A' (each edge on A has an associated solution edge)
Questions:
-Is there a special name for this kind a problem, so I can look for algorithms and solutions?
-I know I can reduce it to a weighted bipartite perfect match, by adding dummy vertices on A, with infinite weigth. But I'm worried about practical performance since |B|>>|A|.
-Any suggestions on Java libraries? I found this: http://algs4.cs.princeton.edu/code/. I think the 'AssignmentProblem.java' is almost what I need - (but I guess it doesn't ensure a perfect matching?)
Thanks in advance and sorry about the bad english.
a) maximum weighted perfect matching
b) ???
c) floyd or floyd-warshall alogorithm is your friend
I've found a c-implemenation in the web and also you can use edmond's blossom algorithm, too.

Lucene. How to build a term-doc matrix

I need to build that matrix but I can't find a way to compute normalized tf-idf for each cell.
The normalization I would perform is cosine-normalization that is divide tf-idf (computed using DefaultSimilarity ) per 1/sqrt(sumOfSquaredtf-idf in the column).
Does anyone know a way to perform that?
Thanks in advance
Antonio
One way, not using Lucene, is described in Sujit Pal's blog. Alternatively, you can build a Lucene index that has term vectors per field, iterate over terms to get idf, then iterate over term's documents to get tf.