Use lucene index in java application - lucene

Recently i stared working on solr. I have created index in solr and i want to query on it through my java application. I don't want to use solr.war in my application. How can i use it through solrj api or lucene java api? My thinking is to add those index in project context and use it. I gone through some examples/tutorials but did not find any on how to work with already created index. Please tell me a proper solution for it or any link specifying the solution will be appreciated.

You can use Lucene apis to create/update and search on an index.
As solr is based on lucene, the underlying index is the lucene index.
Lucene exposes classes as IndexWriter and IndexSearcher, which would help you interact with index.
Example for searching over an solr/lucene index -
Directory index = FSDirectory.open(new File("/path/to/index"));
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Should be able to find examples on this.

Yes, you can use a Solr-created index with Lucene, there's nothing particular about it because Solr itself uses Lucene. So all Lucene documentation applies unchanged.
Or if you don't want to use Solr as a server you can use it embedded in your Java application.

I made it this way..
String realPath = request.getRealPath("/");
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
Directory index = FSDirectory.open(new File(realPath+"/index"));
IndexSearcher indexSearcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(2000, true);
QueryParser query = new QueryParser(Version.LUCENE_CURRENT, "name", analyzer);
Query q = null;
try {
q = query.parse("*:*");
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
indexSearcher.search(q, collector);
ScoreDoc[] scoreDoc = collector.topDocs().scoreDocs;

Related

how to use lucene-gosen analyser with lucene.net?

Please guide me how to use japanese analyser (lucene-gosen) with Lucene.net. And also suggest me some good analyzer for Lucene.net that support Japanese.
The Lucene-Gosen analyzer does not appear to be ported to Lucene.Net. You can make a request on their github page or you could help them out by porting it and submitting a pull request.
Once that analyzer exists and using the article here - using their basic code, just change the analyzer:
string strIndexDir = #"D:\Index";
Lucene.Net.Store.Directory indexDir = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir));
Analyzer std = new JapaneseAnalyzer(Lucene.Net.Util.Version.LUCENE_29); //Version parameter is used for backward compatibility. Stop words can also be passed to avoid indexing certain words
IndexWriter idxw = new IndexWriter(indexDir, std, true, IndexWriter.MaxFieldLength.UNLIMITED);
//Create an Index writer object.
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field fldText = new Lucene.Net.Documents.Field("text", System.IO.File.ReadAllText(#"d:\test.txt"), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED, Lucene.Net.Documents.Field.TermVector.YES);
doc.Add(fldText);
//write the document to the index
idxw.AddDocument(doc);
//optimize and close the writer
idxw.Optimize();
idxw.Close();
Response.Write("Indexing Done");

PhraseQuery+Lucene 4.6 is not working for PDF Word search

Iam Using lucene 4.6 version with Phrase Query for searching the words from PDF. Below is my code. Here Iam able to get the out put text from the PDF also getting the query as contents:"Following are the". But No.of hits is showing as 0. Any suggestions on it?? Thanks in advance.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
iwriter.deleteAll();
iwriter.commit();
Document doc = new Document();
PDDocument document = null;
try {
document = PDDocument.load(strFilepath);
}
catch (IOException ex) {
System.out.println("Exception Occured while Loading the document: " + ex);
}
String output=new PDFTextStripper().getText(document);
System.out.println(output);
//String text = "This is the text to be indexed";
doc.add(new Field("contents", output, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
String sentence = "Following are the";
//IndexSearcher searcher = new IndexSearcher(directory);
if(output.contains(sentence)){
System.out.println("");
}
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
// Iterate through the results:
if(hits.length>0){
System.out.println("Searched text existed in the PDF.");
}
ireader.close();
directory.close();
}
catch(Exception e){
System.out.println("Exception: "+e.getMessage());
}
There are two reasons why your PhraseQuery is not working
StandardAnalyzer uses ENGLISH_STOP_WORDS_SET which contains a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with these words which will be removed from TokenStream while indexing. That means when you search "Following are the" in index, are and the will not be found. so you will never get any result for such a PhraseQuery as are and the will never be there in first place to search with.
Solution for this is use this constructor for
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46, CharArraySet.EMPTY_SET); while indexing this will make sure that StopFilter will not remove any word from TokenStream while indexing.
StandardAnalyzer also uses LowerCaseFilter that means all tokens will be normalized to lower case. so Following will be indexed as following that means searching "Following" won't give you result. For this .toLowerCase() will come to your rescue, just use this on your sentence and you should get results from search.
Also have a look at this link which specify Unicode Standard Annex #29 which is followed by StandardTokenizer. And from brief look at it, it looks like APOSTROPHE, QUOTATION MARK, FULL STOP, SMALL COMMA and many other characters under certain condition will be ignored while indexing.

Lucene - Specialized TokenStream/Analyzer given a set of indexable keywords

I have the following situation
I have a collection of documents to index. But I need to be selective in what I index.
Selection criteria: the document must contain one of the keywords from a given Set.
That part is easy, I can check if any of those keywords are present in the document and only then index the document.
The tricky situation is (for me anyway!), that I want to index only these keywords. And these keywords can be multiworded, or regex expressions as well, say.
What these keywords are going to be is meaningless to this post, because I can abstract that out - I can generate the list of keywords that need to be indexed.
Is there an existing TokenStream, Analyzer, Filter combination that I can use?
And if there isn't, please could someone point me in the right direction.
If my question isn't clear enough:
HashSet<String> impKeywords = new HashSet<String>(new String[] {"Java", "Lucene"});
I have a class Content which I use, say:
Content content = new Content("I am only interested in Java, Lucene, Nutch, Luke, CommonLisp.");
And, say I have a method to get matching keywords:
HashSet<String> matchingKeywords = content.getMatchingKeywords(impKeywords); // returns a set with "Java" and "Lucene"
And if there are matchingKeywords, only then proceed to index the document; so:
if(!matchingKeywords.isEmpty()) {
// prepare document for indexing, and index.
// But what should be my Analyzer and TokenStream?
}
I want to be able to create an Analyzer with a TokenStream that only returns these matching keywords, so only these tokens are indexed.
End notes: One possibility appears to be that for each document I add a variable number of fields with each of the matching keywords. Where these fields are Indexed but not Analyzed using Field.Index.NOT_ANALYZED. However, it would be better if I'm able to figure out a pre-existing Analyzer/TokenStream for this purpose instead of playing around with fields.
Following #femtoRgon's advise I have resolved the said problem as follows.
As explained in the question, I have:
HashSet<String> impKeywords = new HashSet<String>(new String[] {"Java", "Lucene"});
And I have a class Content which I use, say as follows:
Content content = new Content("I am only interested in Java, Lucene, Nutch, Luke, CommonLisp.");
And, I have a method to get matching keywords:
HashSet<String> matchingKeywords = content.getMatchingKeywords(impKeywords); // returns a set with "Java" and "Lucene" for this example `content`.
And if there are matchingKeywords, only then proceed to index the document; so while indexing I did:
if(!matchingKeywords.isEmpty()) {
Document doc = new Document();
for(String keyword: matchingKeywords) {
doc.add(new Field("keyword", keyword, Field.Store.YES, Field.Index.NOT_ANALYZED);
}
iwriter.addDocument(doc); // iwriter is the instance of IndexWriter
}
Then, while searching I created the following boolean query:
BooleanQuery boolQuery = new BooleanQuery();
for(String queryKeyword: searchKeywords)) {
boolQuery.add(new TermQuery(new Term("keyword", queryKeyword)), BooleanClause.Occur.SHOULD);
}
ScoreDoc[] hits = isearcher.search(boolQuery, null, 1000).scoreDocs; // isearcher is the instance of IndexSearcher
Hope this answer helps someone with similar needs.

Why I can't get the doc that added recently by IndexWriter in the search result in Lucene 4.0?

Such as the title said, I have encountered a puzzled problem.
I have built an index for my test program, then I use IndexWriter to add a document into index. The code is :
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc1 = new Document();
doc1.add(new Field("name", "张三", Field.Store.YES, Field.Index.ANALYZED));
doc1.add(new IntField("year", 2013, Field.Store.YES));
doc1.add(new TextField("content", "123456789", Field.Store.YES));
iwriter.addDocument(doc1);
iwriter.commit();
iwriter.close();
When I try to search in this index, I can't get this doc. I really get a correct result count, it is one more than before. But when I try to print the doc.get('name'), the output is wrong.
The code in search part is:
DirectoryReader ireader = DirectoryReader.open(directory);
System.out.println(ireader.numDeletedDocs());
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "name", analyzer);
Query query = parser.parse("张");
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
System.out.println(hits.length);
In results, there is a "Name: 李四".
I'm sure that I use the StandardAnalyzer during indexing and searching. And StandardAnalyzer will make one Chinese character as a single token. Why when I search "张", I will get "李四"? Is there anything wrong when I add a doc? Or the docid is mismatch?
Did you (re)open the index after adding the doc? Lucene searches only return the documents that existed as of the time the index was opened for searching.
[edit...]
Use IndexReader.Open() or IndexReader.doOpenIfChanged() to open the index again. doOpenIfChanged() has the advantage that it returns null if you can still use the old IndexReader instance (because the index has not changed).
(If I recall correctly, DirectoryReader.Open() just opens the index directory, so the higher-level Lucene code does not realize that the index has changed if you just call DirectoryReader.Open.)

Katta docId to Document

How to use FieldCache in Katta, FieldCache expects IndexReader as arguments, then how to get IndexReader from Katta API. And In katta the search method in LuceneClient.java returns Hits.
From this I can get List, from that I can able to get each hit's docId, but I need particular field value of the docId in Katta. Please give me some coding example.
I've never worked with Katta, I worked with Solr and if I had to get document by its id and I had to use only Lucene classes, I'd use org.apache.lucene.search.IndexSearcher:
// when you figure out how to get IndexReader using Katta API, you'll be able to get the searcher
IndexSearcher searcher = new IndexSearcher(indexReader);
org.apache.lucene.document.Document doc = searcher.doc(docId);
String yourFieldValue = doc.get("yourFieldName");
you can't use the FieldCache on client side, since the IndexReader is located on the server side!
But you can get field-values through the getDetails() method on LuceneClient.
final Hits hits = client.search(query, new String[] { INDEX_NAME }, 10);
for (final Hit hit : hits.getHits()) {
final MapWritable details = client.getDetails(hit, new String[] { "path" });
details.get(new Text("path"));
HTH
Johannes