implement feedback in lucene - lucene

Can someone give me hints to apply pseudo-feedback in lucene. I can not find much help on google. I am using Similarity classes.
Is there any class in lucene which I can extend to implement feedback ?
thanks.

Assuming you are referring to this relevance feedback method, Once you have the TopDocs for your original query, iterate through how ever many (let's say we'll want the top 25 terms for the top 25 docs of the original query) records you desire, and call IndexReader.getTermVectors(int), which will grab the information you need. Iterate through each. while storing the term frequencies in a hash map would be the implementation the immediately occurs to me.
Something like:
//Get the original results
TopDocs docs = indexsearcher.search(query,25);
HashMap<String,ScorePair> map = new HashMap<String,ScorePair>();
for (int i = 0; i < docs.scoreDocs.length; i++) {
//Iterate fields for each result
FieldsEnum fields = indexreader.getTermVectors(docs.scoreDocs[i].doc).iterator();
String fieldname;
while (fieldname = fields.next()) {
//For each field, iterate it's terms
TermsEnum terms = fields.terms().iterator();
while (terms.next()) {
//and store it
putTermInMap(fieldname, terms.term(), terms.docFreq(), map);
}
}
}
List<ScorePair> byScore = new ArrayList<ScorePair>(map.values());
Collections.sort(byScore);
BooleanQuery bq = new BooleanQuery();
//Perhaps we want to give the original query a bit of a boost
query.setBoost(5);
bq.add(query,BooleanClause.Occur.SHOULD);
for (int i = 0; i < 25; i++) {
//Add all our found terms to the final query
ScorePair pair = byScore.get(i);
bq.add(new TermQuery(new Term(pair.field,pair.term)),BooleanClause.Occur.SHOULD);
}
}
//Say, we want to score based on tf/idf
void putTermInMap(String field, String term, int freq, Map<String,ScorePair> map) {
String key = field + ":" + term;
if (map.containsKey(key))
map.get(key).increment();
else
map.put(key,new ScorePair(freq,field,term));
}
private class ScorePair implements Comparable{
int count = 0;
double idf;
String field;
String term;
ScorePair(int docfreq, String field, String term) {
count++;
//Standard Lucene idf calculation. This is calculated once per field:term
idf = (1 + Math.log(indexreader.numDocs()/((double)docfreq + 1))) ^ 2;
this.field = field;
this.term = term;
}
void increment() { count++; }
double score() {
return Math.sqrt(count) * idf;
}
//Standard Lucene TF/IDF calculation, if I'm not mistaken about it.
int compareTo(ScorePair pair) {
if (this.score() < pair.score()) return -1;
else return 1;
}
}
(I make no claim that this is functional code, in it's current state.)

Related

Using Lucene's highlighting, getting too much highlighted, is there a workaround for this?

I am using the highlighting feature of Lucene to isolate matching terms for my query, but some of the matched terms are excessive.
I have some simple test cases which are delivered in an Ant project (download details below).
Materials
You can download the test case here: mydemo_with_libs.zip
That archive includes the Lucene 8.6.3 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo_without_libs.zip
The necessary libraries are: core, analyzers, queries, queryparser, highlighter, and memory.
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant synsearch
Input
I have provided a short synonym list which is used for indexing and analysing in the highlighting methods:
cope,manage
jobs,tasks
simultaneously,at once
and there is one document being indexed:
Queues are a useful way of grouping jobs together in order to manage a number of them at once. You can:
hold or release multiple jobs at the same time;
group multiple tasks (for the same event);
control the priority of jobs in the queue;
Eventually log all events that take place in a queue.
Use either job.queue or task.queue in specifications.
Process
When building the index I am storing the text field, and using a custom analyzer. This is because (in the real world) the content I am indexing is technical documentation, so stripping out punctuation is inappropriate because so much of it may be significant in technical expressions. My analyzer uses a TechTokenFilter which breaks the stream up into tokens consisting of strings of words or digits, or individual characters which don't match the previous pattern.
Here's the relevant code for the analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
and here's my filter:
public class TechTokenFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final PositionIncrementAttribute posIncAttr;
private final ArrayList<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAttr;
public TechTokenFilter(TokenStream tokenStream) {
super(tokenStream);
termStack = new ArrayList<>();
termAttr = addAttribute(CharTermAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if (this.termStack.isEmpty() && input.incrementToken()) {
final String currentTerm = termAttr.toString();
final int bufferLen = termAttr.length();
if (bufferLen > 0) {
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(techTokens(currentTerm)));
current = captureState();
}
}
}
if (!this.termStack.isEmpty()) {
String part = termStack.remove(0);
restoreState(current);
termAttr.setEmpty().append(part);
posIncAttr.setPositionIncrement(1);
return true;
} else {
return false;
}
}
public static String[] techTokens(String t) {
List<String> tokenlist = new ArrayList<String>();
String[] tokens;
StringBuilder next = new StringBuilder();
String token;
char minus = '-';
char underscore = '_';
char c, prec, subc;
// Boolean inWord = false;
for (int i = 0; i < t.length(); i++) {
prec = i > 0 ? t.charAt(i - 1) : 0;
c = t.charAt(i);
subc = i < (t.length() - 1) ? t.charAt(i + 1) : 0;
if (Character.isLetterOrDigit(c) || c == underscore) {
next.append(c);
// inWord = true;
}
else if (c == minus && Character.isLetterOrDigit(prec) && Character.isLetterOrDigit(subc)) {
next.append(c);
} else {
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
next.setLength(0);
}
if (Character.isWhitespace(c)) {
// shouldn't be possible because the input stream has been tokenized on
// whitespace
} else {
tokenlist.add(String.valueOf(c));
}
// inWord = false;
}
}
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
// next.setLength(0);
}
tokens = tokenlist.toArray(new String[0]);
return tokens;
}
}
Examining the index I can see that the index contains the separate terms I expect, including the synonym values. For example the text at the end of the first line has produced the terms
of
them
at , simultaneously
once
.
You
can
:
and the text at the end of the third line has produced the terms
same
event
)
;
When the application performs a search it analyzes the query without using the synonym list (because the synonyms are already in the index), but I have discovered that I need to include the synonym list when analyzing the stored text to identify the matching fragments.
Searches match the correct documents, but the code I have added to identify the matching terms over-performs. I won't show all the search method here, but will focus on the code which lists matched terms:
public static void doSearch(IndexReader reader, IndexSearcher searcher,
Query query, int max, String synList) throws IOException {
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("\001", "\002");
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
Analyzer analyzer;
if (synList != null) {
analyzer = new MyAnalyzer(synList);
} else {
analyzer = new MyAnalyzer();
}
// Collect all the docs
TopDocs results = searcher.search(query, max);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits.value);
System.out.println("\nQuery: " + query.toString());
System.out.println("Matches: " + numTotalHits);
// Collect matching terms
HashSet<String> matchedWords = new HashSet<String>();
int start = 0;
int end = Math.min(numTotalHits, max);
for (int i = start; i < end; i++) {
int id = hits[i].doc;
float score = hits[i].score;
Document doc = searcher.doc(id);
String docpath = doc.get("path");
String doctext = doc.get("text");
try {
TokenStream tokens = TokenSources.getTokenStream("text", null, doctext, analyzer, -1);
TextFragment[] frag = highlighter.getBestTextFragments(tokens, doctext, false, 100);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
String match = frag[j].toString();
addMatchedWord(matchedWords, match);
}
}
} catch (InvalidTokenOffsetsException e) {
System.err.println(e.getMessage());
}
System.out.println("matched file: " + docpath);
}
if (matchedWords.size() > 0) {
System.out.println("matched terms:");
for (String word : matchedWords) {
System.out.println(word);
}
}
}
Problem
While the correct documents are selected by these queries, and the fragments chosen for highlighting do contain the query terms, the highlighted pieces in some of the selected fragments extend over too much of the input.
For example, if the query is
+text:event +text:manage
(the first example in the test case) then I would expect to see 'event' and 'manage' in the highlighted list. But what I actually see is
event);
manage
Despite the highlighting process using an analyzer which breaks terms apart and treats punctuation characters as single terms, the highlight code is "hungry" and breaks on whitespace alone.
Similarly if the query is
+text:queeu~1
(my final test case) I would expect to only see 'queue' in the list. But I get
queue.
job.queue
task.queue
queue;
It is so nearly there... but I don't understand why the highlighted pieces are inconsistent with the index, and I don't think I should have to parse the list of matches through yet another filter to produce the correct list of matches.
I would really appreciate any pointers to what I am doing wrong or how I could improve my code to deliver exactly what I need.
Thanks for reading this far!
I managed to get this working by replacing the WhitespaceTokenizer and TechTokenFilter in my analyzer with a PatternTokenizer; the regular expression took a bit of work but once I had it all the matching terms were extracted with pinpoint accuracy.
The replacement analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
private static final String tokenRegex = "(([\\w]+-)*[\\w]+)|[^\\w\\s]";
#Override
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRegex), 0);
TokenStream result = new LowerCaseFilter(src);
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}

Make both sorting algorithms put the values in descending order. Then create drive class to test the two algorithms

I am extremely confused on how to reverse these sorting methods. Any help would be appreciated.
I have looked up and tried researching but I could not find anything to do with this type of comparable list.
public class Sorting
{
public static void selectionSort(Comparable[] list)
{
int min;
Comparable temp;
for (int index = 0; index < list.length-1; index++)
{
min = index;
for (int scan = index+1; scan < list.length; scan++)
if (list[scan].compareTo(list[min]) < 0)
min = scan;
temp = list[min];
list[min] = list[index];
list[index] = temp;
}
}
public static void insertionSort(Comparable[] list)
{
for (int index = 1; index < list.length; index++)
{
Comparable key = list[index];
int position = index;
while (position > 0 && key.compareTo(list[position-1]) < 0)
{
list[position] = list[position-1];
position--;
}
list[position] = key;
}
}
}
I think if you change:
if (list[scan].compareTo(list[min]) < 0)
to
if (list[scan].compareTo(list[min]) > 0)
it will sort in reverse order.
Here is the api
int compareTo(T o)
Compares this object with the specified object for order. Returns a negative integer, zero, or a positive integer as this object is less than, equal to, or greater than the specified object.

Lucene TFIDF does not return 1 for exactly same query with certain document

I implemented a program to rank documents based on its TFIDF similarity score given a user input.
Following is the program:
public class Ranking{
private static int maxHits = 10;
private static Connection connect = null;
private static PreparedStatement preparedStatement = null;
private static ResultSet resultSet = null;
public static void main(String[] args) throws Exception {
System.out.println("Enter your paper title: ");
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String paperTitle = null;
paperTitle = br.readLine();
Class.forName("com.mysql.jdbc.Driver");
connect = DriverManager.getConnection("jdbc:mysql://localhost/arnetminer?"
+ "user=root&password=1234");
preparedStatement = connect.prepareStatement
("SELECT stoppedstemmedtitle from arnetminer.new_bigdataset "
+ "where title="+"'"+paperTitle+"';");
resultSet = preparedStatement.executeQuery();
resultSet.next();
String stoppedstemmedtitle = resultSet.getString(1);
String querystr = args.length > 0 ? args[0] :stoppedstemmedtitle;
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
Query q = new QueryParser(Version.LUCENE_42, "stoppedstemmedtitle", analyzer).parse(querystr);
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("E:/Lucene/new_bigdataset_index")));
IndexSearcher searcher = new IndexSearcher(reader);
VSMSimilarity vsmSimiliarty = new VSMSimilarity();
searcher.setSimilarity(vsmSimiliarty);
TopDocs hits = searcher.search(q, maxHits);
ScoreDoc[] scoreDocs = hits.scoreDocs;
PrintWriter writer = new PrintWriter("E:/Lucene/result/1.txt", "UTF-8");
int counter = 0;
for (int n = 0; n < scoreDocs.length; ++n) {
ScoreDoc sd = scoreDocs[n];
System.out.println(scoreDocs[n]);
float score = sd.score;
int docId = sd.doc;
Document d = searcher.doc(docId);
String fileName = d.get("title");
String year = d.get("pub_year");
String paperkey = d.get("paperkey");
System.out.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
writer.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
++counter;
}
writer.close();
}
}
And
public class VSMSimilarity extends DefaultSimilarity{
// Weighting codes
public boolean doBasic = true; // Basic tf-idf
public boolean doSublinear = false; // Sublinear tf-idf
public boolean doBoolean = false; // Boolean
//Scoring codes
public boolean doCosine = true;
public boolean doOverlap = false;
// term frequency in document = measure of how often a term appears in the document
public float tf(int freq) {
return super.tf(freq);
}
// inverse document frequency = measure of how often the term appears across the index
public float idf(int docFreq, int numDocs) {
// The default behaviour of Lucene is 1 + log (numDocs/(docFreq+1)), which is what we want (default VSM model)
return super.idf(docFreq, numDocs);
}
// normalization factor so that queries can be compared
public float queryNorm(float sumOfSquaredWeights){
return super.queryNorm(sumOfSquaredWeights);
}
// number of terms in the query that were found in the document
public float coord(int overlap, int maxOverlap) {
// else: can't get here
return super.coord(overlap, maxOverlap);
}
// Note: this happens an index time, which we don't take advantage of (too many indices!)
public float computeNorm(String fieldName, FieldInvertState state){
// else: can't get here
return super.computeNorm(state);
}
}
However, it does not return value 1 for exact documents that has 100% similarity with the input.
If i put user input as follows:Logic Based Knowledge Representation
The output I got and the TFIDF score are (5.165 for document that has 100% similarity with the input):
3086,Logic Based Knowledge Representation.,1999,5.165
33586,A Logic for the Representation of Spatial Knowledge.,1991,4.663
328937,Logic Programming for Knowledge Representation.,2007,4.663
219720,Logic for Knowledge Representation.,1984,4.663
487587,Knowledge Representation with Logic Programs.,1997,4.663
806195,Logic Programming as a Representation of Knowledge.,1983,4.663
806833,The Role of Logic in Knowledge Representation.,1983,4.663
744914,Knowledge Representation and Logic Programming.,2002,4.663
1113802,Knowledge Representation in Fuzzy Logic.,1989,4.663
984276,Logic Programming and Knowledge Representation.,1994,4.663
Is this a normal thing or is there something wrong with my tfidf implementation?
Thank you very much!
First of all - Lucene already have TF-IDF similarity - org.apache.lucene.search.similarities.TFIDFSimilarity
Second one -
tf–idf, short for term frequency–inverse document frequency, is a
numerical statistic that is intended to reflect how important a word
is to a document in a collection or corpus
I've marked word, so this tf-idf stuff is applicable only for one word query, but when query have mutliple words tf-idf will be done like this:
One of the simplest ranking functions is computed by summing the
tf–idf for each query term
So, this is the reason, why tf-idf could return you a score more than 1

How to multiply several fields in a tuple by a given field of the tuple

For each row of data, I would like to multiply fields 1 through N by field 0. The data could have hundreds of fields per row (or a variable number of fields for that matter), so writing out each pair is not feasible. Is there a way to specify a range of fields, sort of like the the following (incorrect) snippet?
A = LOAD 'foo.csv' USING PigStorage(',');
B = FOREACH A GENERATE $0*($1,..);
A UDF could come in handy here.
Implement exec(Tuple input) and iterate over all fields of the tuple as follows (not tested):
public class MultiplyField extends EvalFunc<Long> {
public Long exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
Long retVal = 1;
for (int i = 0; i < input.size(); i++) {
Long j = (Long)input.get(i);
retVal *= j;
}
return retVal;
} catch(Exception e) {
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}
Then register your UDF and call it from your FOREACH.

In Lucene, why do my boosted and unboosted documents get the same score?

At index time I am boosting certain document in this way:
if (myCondition)
{
document.SetBoost(1.2f);
}
But at search time documents with all the exact same qualities but some passing and some failing myCondition all end up having the same score.
And here is the search code:
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.Add(new TermQuery(new Term(FieldNames.HAS_PHOTO, "y")), BooleanClause.Occur.MUST);
booleanQuery.Add(new TermQuery(new Term(FieldNames.AUTHOR_TYPE, AuthorTypes.BLOGGER)), BooleanClause.Occur.MUST_NOT);
indexSearcher.Search(booleanQuery, 10);
Can you tell me what I need to do to get the documents that were boosted to get a higher score?
Many Thanks!
Lucene encodes boosts on a single byte (although a float is generally encoded on four bytes) using the SmallFloat#floatToByte315 method. As a consequence, there can be a big loss in precision when converting back the byte to a float.
In your case SmallFloat.byte315ToFloat(SmallFloat.floatToByte315(1.2f)) returns 1f because 1f and 1.2f are too close to each other. Try using a bigger boost so that your documents get different scores. (For exemple 1.25, SmallFloat.byte315ToFloat(SmallFloat.floatToByte315(1.25f)) gives 1.25f.)
Here is the requested test program that was too long to post in a comment.
class Program
{
static void Main(string[] args)
{
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer());
const string FIELD = "name";
for (int i = 0; i < 10; i++)
{
StringBuilder notes = new StringBuilder();
notes.AppendLine("This is a note 123 - " + i);
string text = notes.ToString();
Document doc = new Document();
var field = new Field(FIELD, text, Field.Store.YES, Field.Index.NOT_ANALYZED);
if (i % 2 == 0)
{
field.SetBoost(1.5f);
doc.SetBoost(1.5f);
}
else
{
field.SetBoost(0.1f);
doc.SetBoost(0.1f);
}
doc.Add(field);
writer.AddDocument(doc);
}
writer.Commit();
//string TERM = QueryParser.Escape("*+*");
string TERM = "T";
IndexSearcher searcher = new IndexSearcher(dir);
Query query = new PrefixQuery(new Term(FIELD, TERM));
var hits = searcher.Search(query);
int count = hits.Length();
Console.WriteLine("Hits - {0}", count);
for (int i = 0; i < count; i++)
{
var doc = hits.Doc(i);
Console.WriteLine(doc.ToString());
var explain = searcher.Explain(query, i);
Console.WriteLine(explain.ToString());
}
}
}