Searching sentences in PDF using Lucene phrase query and PDFBOX

Searching sentences in PDF using Lucene phrase query and PDFBOX - lucene

I have used the following code for searching text in pdf. It is working fine with single word. But for sentences as mentioned in the code, it is showing that it is not present even if the text is present in the document. can any one help me in resolving this?
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
PDDocument document = null;
try {
document = PDDocument.load(strFilepath);
}
catch (IOException ex) {
System.out.println("Exception Occured while Loading the document: " + ex);
}
int i =1;
String name = null;
String output=new PDFTextStripper().getText(document);
//String text = "This is the text to be indexed";
doc.add(new Field("contents", output, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer);
String sentence = "Following are the";
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
if(hits.length>0){
System.out.println("Searched text existed in the PDF.");
}
ireader.close();
directory.close();
}
catch(Exception e){
System.out.println("Exception: "+e.getMessage());
}
}

You should use the query parser to create a query from your sentence instead of creating your phrasequery by yourself. your self created query contains the term "Following" which is not indexed since the standard analyzer will lowercase it during indexing so only "following" is indexed.

Related

Lucene, relevance/scoring for an in-memory string

I am building a bot that monitors HN for topics that I am interested in.
I'd like to analyze an in-memory string, and determine if it contains some keywords that I am interested in.
I'd like it to take into consideration the things that Lucene does when performing a standard query (word stemming, stop words, normalizing punctuation, etc).
I could probably build an in-memory index, and query it using the normal approach, but is there a way that I can use the internals of Lucene to avoid a needless index being built?
Bonus points if I can get a relevance value (0.0-1.0), instead of just a true/false value.
Pseudo code:
public static decimal IsRelevant(string keywords, string input)
{
// Does the "input" variable look like it contains "keywords"?
}
IsRelevant("books", "I just bought a book, and I like it."); // matching!
IsRelevant("book", "I just bought many books!"); // matching!

I created a solution using an in-memory search index. It's not ideal, but it does the task.
public static float RelevanceScore(string keyword, string input)
{
var directory = new RAMDirectory();
var analyzer = new EnglishAnalyzer(LuceneVersion.LUCENE_48);
using (var writer = new IndexWriter(directory, new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)))
{
var doc = new Document();
doc.Add(new Field("input", input, Field.Store.YES, Field.Index.ANALYZED));
writer.AddDocument(doc);
writer.Commit();
}
using (var reader = IndexReader.Open(directory))
{
var searcher = new IndexSearcher(reader);
var parser = new QueryParser(LuceneVersion.LUCENE_48, "input", analyzer);
var query = parser.Parse(keyword);
var result = searcher.Search(query, null, 10);
if (result.ScoreDocs.Length == 0)
{
return 0;
}
var doc = result.ScoreDocs.Single();
return doc.Score;
}
}

perform query if document contains a word has low score

I created a Lucene index and want to find all documents that contain a certain word or phrase.
When i do that, i recognized that the score gets lower the longer the text is that contains that word.
How can I create a query that only checks for the existence of a word in my documents / fields?
That's how I created the index
public static Directory CreateIndex(IEnumerable<WorkItemDto> workItems)
{
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
Directory index = new RAMDirectory();
IndexWriter writer = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
foreach (WorkItemDto workItemDto in workItems)
{
Document doc = new Document();
doc.Add(new Field("Title", workItemDto.Title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
//doc.Add(new NumericField("ID", Field.Store.YES, true).SetIntValue(workItemDto.Id));
writer.AddDocument(doc);
}
writer.Dispose();
return index;
}
And this is how i created the query:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
Query query = new QueryParser(Version.LUCENE_30, "Title", analyzer).Parse("Some");
IndexSearcher searcher = new IndexSearcher(indexDir);
TopDocs docs = searcher.Search(query, 10);
ScoreDoc[] hits = docs.ScoreDocs;

lucene search is working for only small letters

i am adding my lucene document like following
final Document document = new Document();
document.add(new Field("login", user.getLogin(), Field.Store.YES, Field.Index.NO));
document.add(new Field("email", user.getEmail(), Field.Store.YES, Field.Index.ANALYZED));
document.add(new Field("firstName", user.getFirstName(), Field.Store.YES, Field.Index.ANALYZED));
document.add(new Field("lastName", user.getLastName(), Field.Store.YES, Field.Index.ANALYZED));
userIndexWriter.addDocument(document);
So if i search with small letters , the search is successful, but if i search with capital letters, the search returns nothing.
Anybody has a clue if i am missing something..?
analyzer = new StandardAnalyzer(Version.LUCENE_36);
final IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_36, analyzer);
final IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
and my search manager
final SearcherManager searcherManager = new SearcherManager(indexWriter, true, null);
and i am searching like following
final BooleanQuery booleanQuery = new BooleanQuery();
final Query query1 = new PrefixQuery(new Term("email", prefix));
final Query query2 = new PrefixQuery(new Term("firstName", prefix));
final Query query3 = new PrefixQuery(new Term("lastName", prefix));
booleanQuery.add(query1, BooleanClause.Occur.SHOULD);
booleanQuery.add(query2, BooleanClause.Occur.SHOULD);
booleanQuery.add(query3, BooleanClause.Occur.SHOULD);
final SortField sortField = new SortField("firstName", SortField.STRING, true);
final Sort sort = new Sort(sortField);
final TopDocs topDocs = searcherManager .search(booleanQuery, DEFAULT_TOP_N_SEARCH_USER, sort);

Make sure you apply the same analysis to both the document and query. For instance, if you set the indexing analyzer to be StandardAnalzyer, then you need also to apply it to your query like this:
QueryParser queryParser = new QueryParser(Version.LUCENE_CURRENT, "firstName", new StandardAnalyzer(Version.LUCENE_CURRENT));
try {
Query q = queryParser.parse("Ameer");
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

lucene updateDocument not work

I am using Lucene 3.6. I want to know why update does not work. Is there anything wrong?
public class TokenTest
{
private static String IndexPath = "D:\\update\\index";
private static Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33);
public static void main(String[] args) throws Exception
{
try
{
update();
display("content", "content");
}
catch (IOException e)
{
e.printStackTrace();
}
}
#SuppressWarnings("deprecation")
public static void display(String keyField, String words) throws Exception
{
IndexSearcher searcher = new IndexSearcher(FSDirectory.open(new File(IndexPath)));
Term term = new Term(keyField, words);
Query query = new TermQuery(term);
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits)
{
Document doc = searcher.doc(hit.doc);
System.out.println("doc_id = " + hit.doc);
System.out.println("内容: " + doc.get("content"));
System.out.println("路径:" + doc.get("path"));
}
}
public static String update() throws Exception
{
IndexWriterConfig writeConfig = new IndexWriterConfig(Version.LUCENE_33, analyzer);
IndexWriter writer = new IndexWriter(FSDirectory.open(new File(IndexPath)), writeConfig);
Document document = new Document();
Field field_name2 = new Field("path", "update_path", Field.Store.YES, Field.Index.ANALYZED);
Field field_content2 = new Field("content", "content update", Field.Store.YES, Field.Index.ANALYZED);
document.add(field_name2);
document.add(field_content2);
Term term = new Term("path", "qqqqq");
writer.updateDocument(term, document);
writer.optimize();
writer.close();
return "update_path";
}
}

I assume you want to update your document such that field "path" = "qqqq". You have this exactly backwards (please read the documentation).
updateDocument performs two steps:
Find and delete any documents containing term
In this case, none are found, because your indexed documents does not contain path:qqqq
Add the new document to the index.
You appear to be doing the opposite, trying to lookup by document, then add the term to it, and it doesn't work that way. What you are looking for, I believe, is something like:
Term term = new Term("content", "update");
document.removeField("path");
document.add("path", "qqqq");
writer.updateDocument(term, document);

storing the RDBMS table data through lucene in text file on hard disk

I want to store the RDBMS sql query result of 3.2 million records in text file using lucene and then search that.
[I saw the example here how to integrate RAMDirectory into FSDirectory in lucene
[1]: how to integrate RAMDirectory into FSDirectory in lucene .I have this piece of code that is working for me
public class lucetest {
public static void main(String args[]) {
lucetest lucetestObj = new lucetest();
lucetestObj.main1(lucetestObj);
}
public void main1(lucetest lucetestObj) {
final File INDEX_DIR = new File(
"C:\\Documents and Settings\\44444\\workspace\\lucenbase\\bin\\org\\lucenesample\\index");
try {
Connection conn;
Class.forName("com.teradata.jdbc.TeraDriver").newInstance();
conn = DriverManager.getConnection(
"jdbc:teradata://x.x.x.x/CHARSET=UTF16", "aaa", "bbb");
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
// Directory index = new RAMDirectory(); //To use RAM space
Directory index = FSDirectory.open(INDEX_DIR); //To use Hard disk,This will not consume RAM
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
analyzer);
IndexWriter writer = new IndexWriter(index, config);
// IndexWriter writer = new IndexWriter(INDEX_DIR, analyzer, true);
System.out.println("Indexing to directory '" + INDEX_DIR + "'...");
lucetestObj.indexDocs(writer, conn);
writer.optimize();
writer.close();
System.out.println("pepsi");
lucetestObj.searchDocs(index, analyzer, "india");
try {
conn.close();
} catch (SQLException e2) {
// TODO Auto-generated catch block
e2.printStackTrace();
}
} catch (Exception e) {
e.printStackTrace();
} finally {
}
}
void indexDocs(IndexWriter writer, Connection conn) throws Exception {
String sql = "select id, name, color from pet";
String queryy = " SELECT CFMASTERNAME, " + " ULTIMATEPARENTID,"
+ "ULTIMATEPARENT, LONG_NAMEE FROM XCUST_SRCH_SRCH"
+ "sample 100000;";
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery(queryy);
int kk = 0;
while (rs.next()) {
Document d = new Document();
d.add(new Field("id", rs.getString("CFMASTERID"), Field.Store.YES,
Field.Index.NO));
d.add(new Field("name", rs.getString("CFMASTERNAME"),
Field.Store.YES, Field.Index.ANALYZED));
d.add(new Field("color", rs.getString("LONG_NAMEE"),
Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(d);
}
if (rs != null) {
rs.close();
}
}
void searchDocs(Directory index, StandardAnalyzer analyzer,
String searchstring) throws Exception {
String querystr = searchstring.length() > 0 ? searchstring : "lucene";
Query q = new QueryParser(Version.LUCENE_35, "name", analyzer)
.parse(querystr);
int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ".CFMASTERNAME " + d.get("name")
+ " ****LONG_NAMEE**" + d.get("color") + "****ID******"
+ d.get("id"));
}
searcher.close();
}
}
How to format this code so that instead of RAM directory the sql result table is saved on the hard disk at the path specified.I am not able to work out a solution.My requirement is that this table data stored on disk through lucene returns result very fast.Hence i am saving data on disk through lucene which is indexed.

Directory index = FSDirectory.open(INDEX_DIR);
You mention saving the sql result to a text file, but that is unnecessary overhead. As you iterate through a ResultSet, save the rows directly to the Lucene index.
As an aside, not that it matters much, but naming your local var (final or otherwise) in all caps is against the convention. Use camelCase. All caps is only for class-level constants (static final members of a class).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Searching sentences in PDF using Lucene phrase query and PDFBOX - lucene

You should use the query parser to create a query from your sentence instead of creating your phrasequery by yourself. your self created query contains the term "Following" which is not indexed since the standard analyzer will lowercase it during indexing so only "following" is indexed.

Related

Lucene, relevance/scoring for an in-memory string

perform query if document contains a word has low score

lucene search is working for only small letters

lucene updateDocument not work

storing the RDBMS table data through lucene in text file on hard disk

Categories

Resources