Relevance and Similarity Computation in Apache Lucene 7.5.x? - lucene

What is the difference between TFIDFSimilarity, DefaultSimilarity, and SweetSpotSimilarity in Lucene 7.5.1?
How can we implement BM25F in Lucene?

TFIDFSimilarity - An abstract base class for TF-IDF similarities. A fairly straightforward tf-idf implementation. Exact algorithm is well documented: TFIDFSimilarity
DefaultSimilarity - Not a thing anymore. Deprecated in 5.0, removed in 6.0.
ClassicSimilarity - The old default similarity. An implementation of TFIDFSimilarity. Adds baseline calculations for tf, idf, length norms and encoding/decoding of norms, etc.
SweetSpotSimilarity - An alternate implementation of TFIDFSimilarity. Extends ClassicSimilarity, primaryily changes how lengthnorms are calculated.
BM25Similarity - The current default similarity implementation. Implementation of Okapi BM25.
As for BM25F, not aware of an implementation of it, out of the box. You'll likely want to modify BM25Similarity to suit that purpose. This article: BM25F in Lucene with BlendedTermQuery may be helpful.

Related

Advantages and drawbacks to implementing core methods of a scripting language in the underlying language

Background: I am writing a scripting language interpreter as a way to test out some experimental language ideas. I am to the point of writing the core set of standard methods (functions) for built-in types. Some of these methods need to directly interface with the underlying data structures and must be be written using the underlying language (Haskell in my case, but that is not important for this question). Others can be written in the scripting language itself if I choose.
Question: What are the advantages and drawbacks to implementing core library functions in either the underlying language or in the language itself?
Example: My language includes as a built-in type Arrays that work just like you think they do -- ordered data grouped together. An Array instance (this is an OO language) has methods inject, map and each. I have implemented inject in Haskell. I could write map and each in Haskell as well, or I could write them in my language using inject. For example:
def map(fn)
inject([]) do |acc,val|
acc << fn(val)
#end inject
#end def map
def each(fn)
inject(nil) do |acc,val|
fn val
#end inject
#end def each
I would like to know what the advantages and drawbacks are to each choice?
The main advantage is that you're eating your own dog food. You get to write more code in your language, and hence get a better idea of what it's like, at least for generic library code. This is not only a good opportunity to notice deficiencies in the language design, but also in the implementation. In particular, you'll find more bugs and you'll find out whether abstractions like these can be implemented efficiently or whether there is a fundamental barrier that forces one to write performance-sensitive code in another language.
This leads, of course, to one disadvantage: It may be slower, either in programmer time or run time performance. However, the first is valuable experience for you, the language designer, and the second should be incentive to optimize the implementation (assuming you care about performance) rather than working around the problem — it weakens your language and doesn't solve the same problem for other users who can't modify the implementation.
There are also advantages for future-proofing the implementation. The language should remain stable in the face of major modifications under the hood, so you'll have to rewrite less code when doing those. Conversely, the functions will be more like other user-defined functions: There is a real risk that, when defining some standard library function or type in the implementation language, subtle differences sneak in that make the type or function behave in a way that can't be emulated by the language.

How to communicate between a Lucene analyzer and a TokenFilter

I'd like to ask is there a recommended way to communicate between an analyzer and a tokenstream/tokenfilter in Lucene. I'd like to adjust analyzer behavior, particularly override Analyzer.getPositionIncrementGap() with a context specific method (e.g. the gap between the tokens should be dependent on what tokens are being analyzed and what tokens have been analyzed so far - probably the task has other possible solutions, but here I want to understand the principle of such a communication).
I've checked an analyzer instance (in debugger), and as it seems there're no "direct gates" from the Analyzer or its subclasses to the attributes storage, used by the current Tokenizer and/or Token Filters chain. Am I right?
Certainly I could create a custom Analyzer subclass and save a reference to one of the TokenFilters created in Analyzer.createComponents() as a property of the Analyzer, and then access the Attributes storage through that property, but it seems to be quite ugly solution, doesn't it?

How can we implement Strategy Pattern using AspectJ

Can I implement Strategy Pattern using AOP. I would like to either
1. Override the default algorithm
2. Or Would like to dynamically select any of the given algorithm.
Thanks,
Look at "AspectJ Cookbook" by Russell Miles. It provides implementation of almost all classical design patterns from the point of AspectJ's view. Here is direct link to strategy pattern http://books.google.com/books?id=AKuBlJGl7iUC&lpg=PP1&pg=PA230#v=onepage&q&f=true.

Can someone point me to examples of multiparadigm (object-functional) programming in F#?

Can someone point me to examples of multiparadigm (object-functional) programming in F#?
I am specifically looking for examples which combine OO & functional programming. There's been a lot of talk about how F# is a hybrid language but I've not been able to find examples which demonstrate the example of multiparadigm programming.
Thanks
I made a small (600 lines) Tetris clone with F# that is Object Oriented using XNA. The code is old (uses #light) and isn't the prettiest code you will ever see but it's defiantly a mix of OOP and functional. It consists of ten classes. I don't think I pass any first class functions around but it's a good example of the F#'s functional power in programming the small.
MyGame - Inherits XNA main game class and is the programs entry point.
Board - Keeps track of pieces that are no longer moving and horizontal line completes.
UI - The UI only has two states (playing and main menu) handled by bool stateMenu
Tetris - Handles game state. Game over and piece collision.
Piece - Defines the different Tetris shapes and their movement and drawing.
Player - Handles user input.
Shape - The base graphic object that maps to primative.
Primative - Wraps the Vertex primitive type.
I made a rough class diagram to help. If you have any questions about it feel free to ask in the comment section.
There are two ways of combining the functional and object-oriented paradigm. To some extent, they are independent and you can write immutable (functional) code that is structured using types (written as F# objects). An F# example of Client type written like this would be:
// Functional 'Client' class
type Client(name, income) =
// Memebers are immutable
member x.Name = name
member x.Income = income
// Returns a new instance
member x.WithIncome(ninc) =
new Client(name, ninc)
member x.Report() =
printfn "%s %d" name income
Note that the WithIncome method (which "changes" the income of the client) doesn't actually do any modifications - it follows the functional style and creates a new, updated, client and returns it as the result.
On the other hand, in F# you can also write object-oriented code that has mutable public properties, but uses some immutable data structure under the cover. This may be useful when you have some nice functional code and want to expose it to C# programmers in a traditional (imperative/object-oriented) way:
type ClientList() =
// The list itself is immutable, but the private
// field of the ClientList type can change
let mutable clients = []
// Imperative object-oriented method
member x.Add(name, income) =
clients <- (new Client(name, income))::clients
// Purely functional - filtering of clients
member x.Filter f =
clients |> List.filter f
(The example is taken from the source code of Chapter 9 of my book. There are some more examples of immutable object-oriented code, for example in parallel simulation in Chapter 14).
The most powerful experience I've had mixing OO (specifically mutation) and functional programming is achieving performance gains by using mutable data-structures internally while enjoying all the benefits of immutability by external users. A great example is an implementation I wrote of an algorithm which yields lexicographical permutations you can find here. The algorithm I use is imperative at it's core (repeated mutation steps of an array) which would suffer if implemented with a functional data structure. By taking an input array, making a read-only copy of it initially so the input is not corrupted, and then yielding read-only copies of it in the sequence expression after the mutation steps of the algorithm are performed, we strike a fine balance between OO and functional techniques. The linked answer references the original C++ implementation as well as benchmarks other purely functional implementation answers. The performance of my mixed OO / functional implementation falls in between the superior performance of the OO C++ solution and the pure functional F# solution.
This strategy of using OO / mutable state internally while keeping pure externally to the caller is used throughout the F# library notably with direct use of IEnumerators in the Seq module.
Another example may be found by comparing memoization using a mutable Dictionary implementation vs. an immutable Map implementation, which Don Syme explores here. The immutable Dictionary implementation is faster but no less pure in usage than the Map implementation.
In conclusion, I think using mutable OO in F# is most powerful for library designers seeking performance gains while keeping everything pure functional for library consumers.
I don't know any F#, but I can show you an example of the exact language mechanics you're looking for in Scala. Ignore it if this isn't helpful.
class Summation {
def sum(aLow : Int, aHigh : Int) = {
(aLow to aHigh).foldLeft(0) { (result, number) => result + number }
}
}
object Sample {
def main(args : Array[String]) {
println(new Summation sum(1, 10))
}
}
I tried to keep it super simple. Notice that we're declaring a class to sum a range, but the implementation is used with a functional style. In this way, we can abstract the paradigm we used to implement a piece of code.
I don't know about F#, but most softwares written in Scala are object-functional in nature.
Scala compiler is probably the largest and a state of the art example of an object-functional software. Other notable examples include Akka, Lift, SBT, Kestrel etc. (Googling will find you a lot more object-functional Scala examples.)

Suggested thresholds for some software metrics

I was searching the internet for some suggestions for thresholds for the following well-known software product metrics:
Lack of Cohesion in Methods (for the Henderson-Sellers variant of the metric)
Number of Inherited Methods in a Class
Number of Overriden Methods in a Class
Number of Newly Added Methods in a Class
However I failed to find any. I am particularly interested in the first one. Does anybody know something about this ?
Thanks in advance, Martin
NDepend suggests the following:
http://www.ndepend.com/Metrics.aspx#LCOM
This reference gives for values for LCOM and LCOMHS. It says
LCOM = 1 – (sum(MF)/M*F)
LCOM HS = (M – sum(MF)/F)(M-1)
Where:
M is the number of methods in class (both static and instance
methods are counted, it includes also
constructors, properties
getters/setters, events add/remove
methods).
F is the number of instance fields in the class.
MF is the number of methods of the class accessing a particular
instance field.
Sum(MF) is the sum of MF over all instance fields of the class.
The underlying idea behind these
formulas can be stated as follow: a
class is utterly cohesive if all its
methods use all its instance fields
I'm not sure how well this measure works when dealing with a Java Bean, which could well have a large number of getters and setters each dealing with a single property.
Be aware that there is a lot of variability in the numbers produced by various tools for the "same" metric. Sometimes this is because the original source was imprecise and sometimes it is because the tool maker "improved" the metric. Most metric tools have a default threshold. I'd use that unless you have a strong reason not to.
I do a lot of cohesion measurement for large classes. I don't think I have ever seen an LCOM-HS measurement above 1.0, but I think you may see them for tiny classes (where you probably don't really care that much about cohesiveness). Personally, I use a threshold of 0.8, but that's arbitrary. I've read a lot of papers about cohesion, and I have seen very few thresholds mentioned. This includes the Henderson-Sellers papers that I've read.
djna is correct when he says that cohesion measures will give poor scores for JavaBeans and other "data storage" classes. Furthermore, many cohesion measurements, including LCOM-HS do not consider some things that may lead to misleadingly poor scores. For example, many implementations don't consider relationships with inherited members. LCOM-HS and many others also have an over-reliance on how methods access fields. For example, if you write a class where the methods mainly interact with "data" through their arguments, you will get what appears to be a highly non-cohesive class; whereas in reality, it may be well-designed.
In terms of the other metrics you mentioned, I've seen no recommendations. I've looked around, and the only recommendation I've seen pertaining to the number of XXX methods is a maximum of 20 per class (no detail as to instance vs. static, overridden, etc.).
Here is a list of some papers dealing with OO metrics, mostly cohesion.