I would like to have, in addition to standard term search with tf-idf similarity over text content field, scoring based on "similarity" of numeric fields. This similarity will be depending on distance between the value in query and in document (e.g. gaussian with m= [user input], s= 0.5)
I.e. let's say documents represent people, and person document have two fields:
description (full text)
age (numeric).
I want to find documents like
description:(x y z) age:30
but age to be not the filter, but rather part of score (for person of age 30 multiplier will be 1.0, for 25-year-old person 0.8 etc.)
Can this be achieved in a sensible manner?
EDIT: Finally I found out this can be done by wrapping ValueSourceQuery and TermQuery with CustomScoreQuery. See my solution below.
EDIT 2: With fast-changing versions of Lucene, I just want to add that it was tested on Lucene 3.0 (Java).
Okay, so here's (a bit verbose) proof-of-concept as a full JUnit test. Haven't tested its efficiency yet for large index, but from what I've read probably after a warm-up it should perform well, providing there's enough RAM available to cache numeric fields.
package tests;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.function.CustomScoreQuery;
import org.apache.lucene.search.function.IntFieldSource;
import org.apache.lucene.search.function.ValueSourceQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import junit.framework.TestCase;
public class AgeAndContentScoreQueryTest extends TestCase
{
public class AgeAndContentScoreQuery extends CustomScoreQuery
{
protected float peakX;
protected float sigma;
public AgeAndContentScoreQuery(Query subQuery, ValueSourceQuery valSrcQuery, float peakX, float sigma) {
super(subQuery, valSrcQuery);
this.setStrict(true); // do not normalize score values from ValueSourceQuery!
this.peakX = peakX; // age for which the age-relevance is best
this.sigma = sigma;
}
#Override
public float customScore(int doc, float subQueryScore, float valSrcScore){
// subQueryScore is td-idf score from content query
float contentScore = subQueryScore;
// valSrcScore is a value of date-of-birth field, represented as a float
// let's convert age value to gaussian-like age relevance score
float x = (2011 - valSrcScore); // age
float ageScore = (float) Math.exp(-Math.pow(x - peakX, 2) / 2*sigma*sigma);
float finalScore = ageScore * contentScore;
System.out.println("#contentScore: " + contentScore);
System.out.println("#ageValue: " + (int)valSrcScore);
System.out.println("#ageScore: " + ageScore);
System.out.println("#finalScore: " + finalScore);
System.out.println("+++++++++++++++++");
return finalScore;
}
}
protected Directory directory;
protected Analyzer analyzer = new WhitespaceAnalyzer();
protected String fieldNameContent = "content";
protected String fieldNameDOB = "dob";
protected void setUp() throws Exception
{
directory = new RAMDirectory();
analyzer = new WhitespaceAnalyzer();
// indexed documents
String[] contents = {"foo baz1", "foo baz2 baz3", "baz4"};
int[] dobs = {1991, 1981, 1987}; // date of birth
IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < contents.length; i++)
{
Document doc = new Document();
doc.add(new Field(fieldNameContent, contents[i], Field.Store.YES, Field.Index.ANALYZED)); // store & index
doc.add(new NumericField(fieldNameDOB, Field.Store.YES, true).setIntValue(dobs[i])); // store & index
writer.addDocument(doc);
}
writer.close();
}
public void testSearch() throws Exception
{
String inputTextQuery = "foo bar";
float peak = 27.0f;
float sigma = 0.1f;
QueryParser parser = new QueryParser(Version.LUCENE_30, fieldNameContent, analyzer);
Query contentQuery = parser.parse(inputTextQuery);
ValueSourceQuery dobQuery = new ValueSourceQuery( new IntFieldSource(fieldNameDOB) );
// or: FieldScoreQuery dobQuery = new FieldScoreQuery(fieldNameDOB,Type.INT);
CustomScoreQuery finalQuery = new AgeAndContentScoreQuery(contentQuery, dobQuery, peak, sigma);
IndexSearcher searcher = new IndexSearcher(directory);
TopDocs docs = searcher.search(finalQuery, 10);
System.out.println("\nDocuments found:\n");
for(ScoreDoc match : docs.scoreDocs)
{
Document d = searcher.doc(match.doc);
System.out.println("CONTENT: " + d.get(fieldNameContent) );
System.out.println("D.O.B.: " + d.get(fieldNameDOB) );
System.out.println("SCORE: " + match.score );
System.out.println("-----------------");
}
}
}
This can be achieved using Solr's FunctionQuery
Related
I need to highlight a set of words inside an existing PDF given specific coordinates that i have already extracted.
I am working with pdfbox by Apache (last version 2.0.8).
There is an example file I can use to such a purpose (AddAnnotations.java inside the pdfbox website) but I think this example was compiled with an older Java version as the following import does not work:
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationHighlight;
Can anyone help me with that? Which is the simplest way to highlight words by using this library?
Here is the code to highlight ALL the words inside a PDF document. Highlighting only a specific set of words can be easily performed modifying this script. Please note this is only a test and further checks are needed for words that terminates in a new line as well as words placed in negative landscape/portrait PDF pages. Optimizing this script is also possible.
This script was built using Apache PDFBox 2.0.8.
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.graphics.color.PDColor;
import org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationTextMarkup;
public class TestAnnotatePDF extends PDFTextStripper
{
static List<double[]> coordinates;
static ArrayList tokenStream;
public TestAnnotatePDF() throws IOException
{
//data structed containing coordinates information for each token
coordinates = new ArrayList<>();
//List of words extracted from text (considering a whitespace-based tokenization)
tokenStream = new ArrayList();
}
public static void main(String [] args) throws IOException
{
try
{
//Loading an existing document
File file = new File("MyDocument");
PDDocument document = PDDocument.load(file);
//extended PDFTextStripper class
PDFTextStripper stripper = new TestAnnotatePDF();
//Get number of pages
int number_of_pages = document.getDocumentCatalog().getPages().getCount();
//The method writeText will invoke an override version of writeString
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
//Print collected information
System.out.println(tokenStream);
System.out.println(tokenStream.size());
System.out.println(coordinates.size());
double page_height;
double page_width;
double width, height, minx, maxx, miny, maxy;
int rotation;
//scan each page and highlitht all the words inside them
for (int page_index = 0; page_index < number_of_pages; page_index++)
{
//get current page
PDPage page = document.getPage(page_index);
//Get annotations for the selected page
List<PDAnnotation> annotations = page.getAnnotations();
//Define a color to use for highlighting text
PDColor red = new PDColor(new float[] { 1, 0, 0 }, PDDeviceRGB.INSTANCE);
//Page height and width
page_height = page.getMediaBox().getHeight();
page_width = page.getMediaBox().getWidth();
//Scan collected coordinates
for (int i=0; i<coordinates.size(); i++)
{
//if the current coordinates are not related to the current
//page, ignore them
if ((int) coordinates.get(i)[4] != (page_index+1))
continue;
else
{
//get rotation of the page...portrait..landscape..
rotation = (int) coordinates.get(i)[7];
//page rotated of 90degrees
if (rotation == 90)
{
height = coordinates.get(i)[5];
width = coordinates.get(i)[6];
width = (page_height * width)/page_width;
//define coordinates of a rectangle
maxx = coordinates.get(i)[1];
minx = coordinates.get(i)[1] - height;
miny = coordinates.get(i)[0];
maxy = coordinates.get(i)[0] + width;
}
else //i should add here the cases -90/-180 degrees
{
height = coordinates.get(i)[5];
minx = coordinates.get(i)[0];
maxx = coordinates.get(i)[2];
miny = page_height - coordinates.get(i)[1];
maxy = page_height - coordinates.get(i)[3] + height;
}
//Add an annotation for each scanned word
PDAnnotationTextMarkup txtMark = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);
txtMark.setColor(red);
txtMark.setConstantOpacity((float)0.3); // 30% transparent
PDRectangle position = new PDRectangle();
position.setLowerLeftX((float) minx);
position.setLowerLeftY((float) miny);
position.setUpperRightX((float) maxx);
position.setUpperRightY((float) ((float) maxy+height));
txtMark.setRectangle(position);
float[] quads = new float[8];
quads[0] = position.getLowerLeftX(); // x1
quads[1] = position.getUpperRightY()-2; // y1
quads[2] = position.getUpperRightX(); // x2
quads[3] = quads[1]; // y2
quads[4] = quads[0]; // x3
quads[5] = position.getLowerLeftY()-2; // y3
quads[6] = quads[2]; // x4
quads[7] = quads[5]; // y5
txtMark.setQuadPoints(quads);
txtMark.setContents(tokenStream.get(i).toString());
annotations.add(txtMark);
}
}
}
//Saving the document in a new file
File highlighted_doc = new File("MyDocument_final.pdf");
document.save(highlighted_doc);
document.close();
}
catch(IOException e)
{
System.out.println(e);
}
}
#Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException
{
String token = "";
int token_length = textPositions.size();
int counter = 1;
double minx = 0,maxx = 0,miny = 0,maxy =0;
double height = 0;
double width = 0;
int rotation = 0;
for (TextPosition text : textPositions)
{
rotation = text.getRotation();
if (text.getHeight() > height)
height = text.getHeight();
if (text.getWidth() > width)
width = text.getWidth();
//if it is the first char of the current word
if (counter == 1)
{
minx = text.getX();
miny = text.getY();
}
//if it is the last char of the current word
if (counter == token_length)
{
maxx = text.getEndX();
maxy = text.getY();
}
token += text;
counter += 1;
}
tokenStream.add(token);
double word_coordinates [] = {minx,miny,maxx,maxy,this.getCurrentPageNo(), height, width, rotation};
coordinates.add(word_coordinates);
}}
Here is the code to highlight specific words inside a PDF document. Please note this is working for highlighting the line of the search text. Highlight specific words in a PDF is still in progress... Any suggestion to highlight specific words on top of this code will be highly appreciated.
This script was built using Apache PDFBox 2.0.8
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.graphics.color.PDColor;
import org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationTextMarkup;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
public class PDFhighlightDemo extends PDFTextStripper {
public PDFhighlightDemo() throws IOException {
super();
}
public static void main(String[] args) throws IOException {
PDDocument document = null;
String fileName = "Demo1.pdf";
try {
document = PDDocument.load( new File(fileName) );
PDFTextStripper stripper = new PDFhighlightDemo();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
File file1 = new File("FinalPDF.pdf");
document.save(file1);
}
finally {
if( document != null ) {
document.close();
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*/
#Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
boolean isFound = false;
float posXInit1 = 0,
posXEnd1 = 0,
posYInit1 = 0,
posYEnd1 = 0,
width1 = 0,
height1 = 0,
fontHeight1 = 0;
String[] criteria = {"angular", "prepared"};
for (int i = 0; i < criteria.length; i++) {
if (string.contains(criteria[i])) {
isFound = true;
}
}
if (isFound) {
for(TextPosition textPosition:textPositions) {
posXInit1 = textPositions.get(0).getXDirAdj();
posXEnd1 = textPositions.get(textPositions.size() - 1).getXDirAdj() + textPositions.get(textPositions.size() - 1).getWidth();
posYInit1 = textPositions.get(0).getPageHeight() - textPositions.get(0).getYDirAdj();
posYEnd1 = textPositions.get(0).getPageHeight() - textPositions.get(textPositions.size() - 1).getYDirAdj();
width1 = textPositions.get(0).getWidthDirAdj();
height1 = textPositions.get(0).getHeightDir();
}
float quadPoints[] = {posXInit1, posYEnd1 + height1 + 2, posXEnd1, posYEnd1 + height1 + 2, posXInit1, posYInit1 - 2, posXEnd1, posYEnd1 - 2};
List<PDAnnotation> annotations = document.getPage(this.getCurrentPageNo() - 1).getAnnotations();
PDAnnotationTextMarkup highlight = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);
PDRectangle position = new PDRectangle();
position.setLowerLeftX(posXInit1);
position.setLowerLeftY(posYEnd1);
position.setUpperRightX(posXEnd1);
position.setUpperRightY(posYEnd1 + height1);
highlight.setRectangle(position);
// quadPoints is array of x,y coordinates in Z-like order (top-left, top-right, bottom-left,bottom-right)
// of the area to be highlighted
highlight.setQuadPoints(quadPoints);
PDColor yellow = new PDColor(new float[]{1, 1, 1 / 255F}, PDDeviceRGB.INSTANCE);
highlight.setColor(yellow);
annotations.add(highlight);
}
}
}
Highlight specific words in a document using PDFclown.
package com.NLP.demo;
import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;
import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.tools.TextExtractor;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;
public class PDFCrownDemo {
public static void main() throws IOException {
PDFCrownDemo PDFCrownDemo=new PDFCrownDemo();
PDFCrownDemo.highlighttext();
}
public void highlighttext() throws IOException{
org.pdfclown.files.File file = new org.pdfclown.files.File("src/main/resources/XXX.pdf");
String textRegEx = "Contract";
Pattern pattern = Pattern.compile(textRegEx, Pattern.CASE_INSENSITIVE);
TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
Map<Rectangle2D,List<ITextString>> textStrings = textExtractor.extract(page);
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings));
textExtractor.filter(textStrings,new TextExtractor.IIntervalFilter()
{
#Override
public boolean hasNext()
{return matcher.find();}
#Override
public Interval next()
{return new Interval(matcher.start(), matcher.end());}
#Override
public void process(Interval interval,ITextString match)
{
// Defining the highlight box of the text pattern match...
List highlightQuads = new ArrayList();
{
/*
NOTE: A text pattern match may be split across multiple contiguous lines,
so we have to define a distinct highlight box for each text chunk.
*/
Rectangle2D textBox = null;
for(TextChar textChar : match.getTextChars())
{
Rectangle2D textCharBox = textChar.getBox();
if(textBox == null)
{textBox = (Rectangle2D)textCharBox.clone();}
else
{
if(textCharBox.getY() > textBox.getMaxY())
{
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D)textCharBox.clone();
}
else
{textBox.add(textCharBox);}
}
}
highlightQuads.add(Quad.get(textBox));
}
// Highlight the text pattern match!
new TextMarkup(page,MarkupTypeEnum.Highlight, highlightQuads);
}
#Override
public void remove(
)
{throw new UnsupportedOperationException();}
}
);
}
//file.save(SerializationModeEnum.Incremental);
file.save(new java.io.File("src/main/resources/XXX.pdf"), SerializationModeEnum.Standard);
}
}
is it possible to extract text from an area with PDFbox using just the binaries instead of having to create my own code?
Compile and pack this simple program into a jar
import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripperByArea;
public class ExtractText {
// Usage: xxx.jar filepath page x y width height
public static void main(String[] args) throws IOException {
if (args.length != 6) {
System.out.println("Help info");
return;
}
// Parameters
String filepath = args[0];
int page = Integer.parseInt(args[1]);
int x = Integer.parseInt(args[2]);
int y = Integer.parseInt(args[3]);
int width = Integer.parseInt(args[4]);
int height = Integer.parseInt(args[5]);
PDDocument document = PDDocument.load(new File(filepath));
PDFTextStripperByArea textStripper = new PDFTextStripperByArea();
Rectangle2D rect = new java.awt.geom.Rectangle2D.Float(x, y, width, height);
textStripper.addRegion("region", rect);
PDPage docPage = document.getPage(page);
textStripper.extractRegions(docPage);
String textForRegion = textStripper.getTextForRegion("region");
System.out.println(textForRegion);
}
}
Run it from command line, ex:
xxx.jar filepathToPdf pageToExtract x y width height
Add validation code for parameters and some usage info.
Edit
Also add the PDFbox libraries
java -cp "..." -jar xxx.jar filepathToPdf pageToExtract x y width height
I need to write a code where
a) Go to www.google.com and search for 'calculator'. In the results you will see a calculator coming in the browser itself.
b) Read the Num1 from xls using java code
c) Click the Num1 in calculator of google
Excel File is simple and looks like this.
Num1
7
95
I hope you people understood.
First row Num1
Second row 2
Third row is 95
Using #Test I have written a code
The problem is, I can click Num1 7 but not 95.
Moreover if my num1 is 2 also, it throws me error. Please help.
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
public class Exercise3 {
String part1 = "//*[#id='cwbt";
String part2 = "']/div/span";
#Test(dataProvider="getData")
public void calculator(String num1, String num2, String operation, String expectedResult){
WebDriver d = new FirefoxDriver();
d.manage().window().maximize();
d.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
d.get("https://www.google.co.in/?gfe_rd=cr&ei=2ArXVu7RGafG8AfjsKPgDQ&gws_rd=ssl#q=calculator");
//calculator box
//*[#id='cwbt13']/div/span
//*[#id='cwbt23']/div/span
//*[#id='cwbt33']/div/span
//*[#id='cwbt43']/div/span
WebElement box = d.findElement(By.xpath("//*[#id='cwmcwd']"));
for(int i=13;i<=46;i++){
String num = d.findElement(By.xpath("//*[#id='cwbt"+i+"']/div/span")).getText();
if(num.equals(num1)){
d.findElement(By.xpath("//*[#id='cwbt"+i+"']/div/span")).click();
break;
}
}
}
#DataProvider
public Object[][] getData(){
Xls_Reader xls = new Xls_Reader("E:\\Pessoal\\QTPSelenium\\Excel\\Calculator.xlsx");
int rows = xls.getRowCount("Addition");
int cols = xls.getColumnCount("Addition");
Object data[][] = new Object[rows-1][cols-2];
for(int rNum=2;rNum<=rows;rNum++){
for(int cNum=0;cNum<cols-2;cNum++){
System.out.println(xls.getCellData("Addition", cNum, rNum));
data[rNum-2][cNum] = xls.getCellData("Addition", cNum, rNum);
}
}
return data;
}
}
try following code/ I have updated according to Java Lang
FirefoxDriver driver;
driver = new FirefoxDriver();
driver.manage().timeouts().implicitlyWait(10,TimeUnit.SECONDS);
driver.get("https://www.google.co.in/?gfe_rd=cr&ei=2ArXVu7RGafG8AfjsKPgDQ&gws_rd=ssl#q=calculator");
String strJavaScript = "document.getElementById('cwos').textContent= " + num1.trim() + " " + operation.trim() + " " + num2.trim() ;
Object obj;
//org.openqa.selenium.JavascriptExecutor.class.
obj = driver.executeScript(strJavaScript, (Object) null);
driver.findElement(By.xpath(".//*[#id='cwbt45']/div/span")).click();
String answer = driver.findElement(By.id("cwos")).getText();
if (expectedAnswer == null ? answer == null : expectedAnswer.equals(answer)) {
System.out.println("Expected answer and given answer are same.");
}
else
System.out.println("Expected answer and given answer are not same.");
System.out.print(answer);
if any issue then let me know.
I implemented a program to rank documents based on its TFIDF similarity score given a user input.
Following is the program:
public class Ranking{
private static int maxHits = 10;
private static Connection connect = null;
private static PreparedStatement preparedStatement = null;
private static ResultSet resultSet = null;
public static void main(String[] args) throws Exception {
System.out.println("Enter your paper title: ");
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String paperTitle = null;
paperTitle = br.readLine();
Class.forName("com.mysql.jdbc.Driver");
connect = DriverManager.getConnection("jdbc:mysql://localhost/arnetminer?"
+ "user=root&password=1234");
preparedStatement = connect.prepareStatement
("SELECT stoppedstemmedtitle from arnetminer.new_bigdataset "
+ "where title="+"'"+paperTitle+"';");
resultSet = preparedStatement.executeQuery();
resultSet.next();
String stoppedstemmedtitle = resultSet.getString(1);
String querystr = args.length > 0 ? args[0] :stoppedstemmedtitle;
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
Query q = new QueryParser(Version.LUCENE_42, "stoppedstemmedtitle", analyzer).parse(querystr);
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("E:/Lucene/new_bigdataset_index")));
IndexSearcher searcher = new IndexSearcher(reader);
VSMSimilarity vsmSimiliarty = new VSMSimilarity();
searcher.setSimilarity(vsmSimiliarty);
TopDocs hits = searcher.search(q, maxHits);
ScoreDoc[] scoreDocs = hits.scoreDocs;
PrintWriter writer = new PrintWriter("E:/Lucene/result/1.txt", "UTF-8");
int counter = 0;
for (int n = 0; n < scoreDocs.length; ++n) {
ScoreDoc sd = scoreDocs[n];
System.out.println(scoreDocs[n]);
float score = sd.score;
int docId = sd.doc;
Document d = searcher.doc(docId);
String fileName = d.get("title");
String year = d.get("pub_year");
String paperkey = d.get("paperkey");
System.out.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
writer.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
++counter;
}
writer.close();
}
}
And
public class VSMSimilarity extends DefaultSimilarity{
// Weighting codes
public boolean doBasic = true; // Basic tf-idf
public boolean doSublinear = false; // Sublinear tf-idf
public boolean doBoolean = false; // Boolean
//Scoring codes
public boolean doCosine = true;
public boolean doOverlap = false;
// term frequency in document = measure of how often a term appears in the document
public float tf(int freq) {
return super.tf(freq);
}
// inverse document frequency = measure of how often the term appears across the index
public float idf(int docFreq, int numDocs) {
// The default behaviour of Lucene is 1 + log (numDocs/(docFreq+1)), which is what we want (default VSM model)
return super.idf(docFreq, numDocs);
}
// normalization factor so that queries can be compared
public float queryNorm(float sumOfSquaredWeights){
return super.queryNorm(sumOfSquaredWeights);
}
// number of terms in the query that were found in the document
public float coord(int overlap, int maxOverlap) {
// else: can't get here
return super.coord(overlap, maxOverlap);
}
// Note: this happens an index time, which we don't take advantage of (too many indices!)
public float computeNorm(String fieldName, FieldInvertState state){
// else: can't get here
return super.computeNorm(state);
}
}
However, it does not return value 1 for exact documents that has 100% similarity with the input.
If i put user input as follows:Logic Based Knowledge Representation
The output I got and the TFIDF score are (5.165 for document that has 100% similarity with the input):
3086,Logic Based Knowledge Representation.,1999,5.165
33586,A Logic for the Representation of Spatial Knowledge.,1991,4.663
328937,Logic Programming for Knowledge Representation.,2007,4.663
219720,Logic for Knowledge Representation.,1984,4.663
487587,Knowledge Representation with Logic Programs.,1997,4.663
806195,Logic Programming as a Representation of Knowledge.,1983,4.663
806833,The Role of Logic in Knowledge Representation.,1983,4.663
744914,Knowledge Representation and Logic Programming.,2002,4.663
1113802,Knowledge Representation in Fuzzy Logic.,1989,4.663
984276,Logic Programming and Knowledge Representation.,1994,4.663
Is this a normal thing or is there something wrong with my tfidf implementation?
Thank you very much!
First of all - Lucene already have TF-IDF similarity - org.apache.lucene.search.similarities.TFIDFSimilarity
Second one -
tf–idf, short for term frequency–inverse document frequency, is a
numerical statistic that is intended to reflect how important a word
is to a document in a collection or corpus
I've marked word, so this tf-idf stuff is applicable only for one word query, but when query have mutliple words tf-idf will be done like this:
One of the simplest ranking functions is computed by summing the
tf–idf for each query term
So, this is the reason, why tf-idf could return you a score more than 1
I'd like to index and search lower-cased keywords. I attached test code which IMHO clearly demonstrates my simple goal. I index two words, one with capital letter, and then I search and print them back one by one. For this I created Analyzer which just converts keywords to lower-case (KeywordAnalyzer doesn't lower-case and SimpleAnalyzer splits on non-letter characters). I use this analyzer for both IndexWriter and QueryParser. However, for some reason I can't get back words with capital letters even if I search for lower-cased word ("bye" in the example).
Program expected output:
hello
Bye
Actual output:
hello
What's the problem?
I hope you don't mind code is in Scala. I'll gladly help you understand in case it's not clear what the code does.
import org.apache.lucene.store.FSDirectory
import java.io.{Reader, File}
import org.apache.lucene.index._
import org.apache.lucene.document._
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.queryparser.classic.QueryParser
import org.apache.lucene.analysis.util.CharTokenizer
import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.util.Version
import org.apache.lucene.analysis.Analyzer.TokenStreamComponents
final class LcAnalyzer(lucVer: Version) extends Analyzer {
def createComponents(fieldName: String, reader: Reader) =
new TokenStreamComponents(new CharTokenizer(lucVer, reader) {
def isTokenChar(c: Int) = true
override def normalize(c: Int) = Character.toLowerCase(c)
})
}
object LuceneTest {
val LV = Version.LUCENE_43
val F = "myf"
val VALS = Seq("hello", "Bye")
val indexDir = FSDirectory.open(new File("testindex"))
val anlz = new LcAnalyzer(LV)
def main(args: Array[String]) {
writeData()
val reader = DirectoryReader.open(indexDir)
val searcher = new IndexSearcher(reader)
val p = new QueryParser(LV, F, anlz)
for (v <- VALS) {
val hits = searcher.search(p.parse(F + ':' + v), 1).scoreDocs
for (i <- 0 until hits.length) {
val doc = searcher.doc(hits(i).doc)
println(doc.get(F))
}
}
}
def writeData() {
val writer = {
val wc = new IndexWriterConfig(LV, anlz)
val writer = new IndexWriter(indexDir, wc)
writer.commit
writer
}
for (v <- VALS) {
val doc = new Document
doc.add(new StringField(F, v, Field.Store.YES))
writer.addDocument(doc)
}
writer.commit
writer.close
}
}