Katta docId to Document - lucene

How to use FieldCache in Katta, FieldCache expects IndexReader as arguments, then how to get IndexReader from Katta API. And In katta the search method in LuceneClient.java returns Hits.
From this I can get List, from that I can able to get each hit's docId, but I need particular field value of the docId in Katta. Please give me some coding example.

I've never worked with Katta, I worked with Solr and if I had to get document by its id and I had to use only Lucene classes, I'd use org.apache.lucene.search.IndexSearcher:
// when you figure out how to get IndexReader using Katta API, you'll be able to get the searcher
IndexSearcher searcher = new IndexSearcher(indexReader);
org.apache.lucene.document.Document doc = searcher.doc(docId);
String yourFieldValue = doc.get("yourFieldName");

you can't use the FieldCache on client side, since the IndexReader is located on the server side!
But you can get field-values through the getDetails() method on LuceneClient.
final Hits hits = client.search(query, new String[] { INDEX_NAME }, 10);
for (final Hit hit : hits.getHits()) {
final MapWritable details = client.getDetails(hit, new String[] { "path" });
details.get(new Text("path"));
HTH
Johannes

Related

Index field using: new TextField(String fieldName, Reader reader)

I've been trying to index a field using the readerValue() that Lucene provides in the Fields. The thing is that the terms are not being indexed. This is the interesting part of the code:
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.deleteAll();
String str = "Some random text to be indexed";
Reader reader = new StringReader(str);
Document doc = new Document();
doc.add(new TextField("content", reader));
indexWriter.addDocument(doc);
Now, if I index that text as a String with the other TextField constructor it works fine, but like this it does not index the terms, instead returns null when I try to get the value of the field after a search:
QueryParser queryParser = new QueryParser("content", new SimpleAnalyzer());
Query query = queryParser.parse(text);
TopDocs topDocs = indexSearcher.search(query,10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document document = indexSearcher.doc(scoreDoc.doc);
Reader r = document.getField("content").readerValue();
I really can't see the problem, maybe it is some dumb thing that I missed, or maybe I'm using it wrong? Thanks in advance for any help.
By default, TextField is unstored. The behavior you are seeing is expected for an un-stored field. You should be able to search on it, but not retrieve it from the index. The constructor that takes a string argument for the field contents allows you to set whether to store the field or not, thus the different behavior.
The reason the store option is not available on that constructor, is that Lucene explicitly disallows a stored field to be set with a Reader or TokenStream value. If you want to store the field, you will simply need get the string value from the Reader yourself.

After Lucene search, get character offsets of all matched words in document? (not just preview snippet)

I am creating a search engine for a large number of HTML documents using lucene.
I know I can use PostingsHighlighter and friends to show snippets, with bold words, similar to Google Search results, also similar to this random lucene-based example.
However, unlike these examples, I need a solution that preserves highlighted words, even after the matched document is opened by the user, similar to Google Books.
Some words are hyphenated, in the form <div> ... an inter-</div><div...>national audience ...</div> I am thinking I need to convert these to plain text first, and write some code to merge words that were hyphenated, before I send them to lucene.
Once the resulting document is opened by the user, I'm hoping that I can use lucene to get character offsets of each matched word in the document.
I will have to cross-reference the offsets in the plain text back to the original HTML, and write code to highlight <b> the words based on said offsets.
<div> ... an <b>inter-</b></div><div...><b>national</b> audience ...</div>
How can I get what I need from lucene? Surely I don't have to write my own search for this 'final inch'?
OK, I figured out something I can get started with. :)
To index:
StandardAnalyzer analyzer - new StandardAnalyzer()
Directory index = FSDirectory.open(new File("...").toPath());
IndexWriterConfig config = new IndexWriterConfig(analyzer);
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
// documents need to be read from the data source..
// only add once, or else your docs will be duplicated as you continue to use the system
writer.close();
specify offsets to store for highlighting
private static final FieldType typeOffsets;
static {
typeOffsets = new FieldType(textField.TYPE_STORED);
typeOffsets.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
}
method addDoc
void addDoc(IndexWriter writer, String title, String body) {
Document doc = new Document();
doc.add(new Field("title", body, typeOffsets));
doc.add(new Field("body", body, typeOffsets));
// you can also add an store a TextField that does not have offsets,
// like a file ID that you wouldn't search on, just need to reference original doc.
writer.addDocument(doc);
}
Perform your first search
String q = "...";
String[] fields = new String[] {"title", "body"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer)
Query query = parser.parse(q)
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));
PostingsHighlighter highlighter = new PostingsHighlighter();
TopDocs topDocs = searcher.search(query, 10, Sort.RELEVANCE);
Get highlighted snippets with highlighter.highlightFields(fields, query, searcher, topDocs). You can iterate over the results.
When you want to highlight the end document (i.e. after the search is completed and user selected the result), use this solution (needs minor edits). It works by using NullFragmenter to turn the whole thing into one snippet.
public static String highlight(String pText, String pQuery) throws Exception
{
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
Highlighter highlighter = new Highlighter(new QueryScorer(parser.parse(pQuery)));
highlighter.setTextFragmenter(new NullFragmenter());
String text = highlighter.getBestFragment(analyzer, "", pText);
if (text != null)
{
return text;
}
return pText;
}
Edit: You can actually use PostingsHighlighter for this last step instead of Highlighter, but you have to override getBreakIterator, and then override your BreakIterator so that it thinks the whole document is one sentance.
Edit: You can override getFormatter to capture the offsets, rather than trying to parse the <b> tags normally output by PostingsHighlighter.

Searching for href values with Lucene (Examine in Umbraco)?

I want to search for a href value with lucene/examine - more precise the 'locallink' value. Examine is straight out-of-the-box standard config.
I have the following snippet which does not return any results;
string searchQuery = "localLink:" + id;
UmbracoHelper helper = new UmbracoHelper(UmbracoContext.Current);
foreach (var result in helper.Search(searchQuery, false))
{
// Do something
}
Upon inspection of the index via Developer > Examine Management (in Umbraco backend), I can see that the index does contain the value I am trying to search for but under a "_Raw" property. So I guess the question is, how I can make my search, search in these fields also?
You made search with UmbracoHelper.
Try to use Examine Searcher as described in docs :
var searcher = ExamineManager.Instance.SearchProviderCollection["WebsiteSearcher"];
var searchCriteria = searcher.CreateSearchCriteria(BooleanOperation.Or);
var searchResults = searcher.Search(query);
http://our.umbraco.org/documentation/Reference/Searching/Examine/
http://umbraco.com/follow-us/blog-archive/2011/9/16/examining-examine.aspx

PhraseQuery+Lucene 4.6 is not working for PDF Word search

Iam Using lucene 4.6 version with Phrase Query for searching the words from PDF. Below is my code. Here Iam able to get the out put text from the PDF also getting the query as contents:"Following are the". But No.of hits is showing as 0. Any suggestions on it?? Thanks in advance.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
iwriter.deleteAll();
iwriter.commit();
Document doc = new Document();
PDDocument document = null;
try {
document = PDDocument.load(strFilepath);
}
catch (IOException ex) {
System.out.println("Exception Occured while Loading the document: " + ex);
}
String output=new PDFTextStripper().getText(document);
System.out.println(output);
//String text = "This is the text to be indexed";
doc.add(new Field("contents", output, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
String sentence = "Following are the";
//IndexSearcher searcher = new IndexSearcher(directory);
if(output.contains(sentence)){
System.out.println("");
}
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
// Iterate through the results:
if(hits.length>0){
System.out.println("Searched text existed in the PDF.");
}
ireader.close();
directory.close();
}
catch(Exception e){
System.out.println("Exception: "+e.getMessage());
}
There are two reasons why your PhraseQuery is not working
StandardAnalyzer uses ENGLISH_STOP_WORDS_SET which contains a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with these words which will be removed from TokenStream while indexing. That means when you search "Following are the" in index, are and the will not be found. so you will never get any result for such a PhraseQuery as are and the will never be there in first place to search with.
Solution for this is use this constructor for
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46, CharArraySet.EMPTY_SET); while indexing this will make sure that StopFilter will not remove any word from TokenStream while indexing.
StandardAnalyzer also uses LowerCaseFilter that means all tokens will be normalized to lower case. so Following will be indexed as following that means searching "Following" won't give you result. For this .toLowerCase() will come to your rescue, just use this on your sentence and you should get results from search.
Also have a look at this link which specify Unicode Standard Annex #29 which is followed by StandardTokenizer. And from brief look at it, it looks like APOSTROPHE, QUOTATION MARK, FULL STOP, SMALL COMMA and many other characters under certain condition will be ignored while indexing.

Way of working with Lucene index custom Collector

Can someone help me in undestanding a way of working with customized implementations of abstract Collector class in Lucene?
I've implemented two ways of querying index with some test text:
1.Total hits is eq to 2. Both file names are the same, hence results size is eq to 1 because I keep them in a set.
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
LOG.info("Total hits " + topDocs.totalHits);
ScoreDoc[] scoreDosArray = topDocs.scoreDocs;
for (ScoreDoc scoreDoc : scoreDosArray) {
Document doc = searcher.doc(scoreDoc.doc);
String fileName = doc.get(FILENAME_FIELD);
results.add(fileName);
}
2.CountCollect is eq to 2. Both documents from which I get files names in collect method of the Collector are unique, hence final results size is also eq to 2. CountNextReader variable is at the end of the logic is eq to 10.
private Set<String> doStreamingSearch(final IndexSearcher searcher, Query query) throws IOException {
final Set<String> results = new HashSet<String>();
Collector collector = new Collector() {
private int base;
private Scorer scorer;
private int countCollect;
private int countNextReader;
#Override
public void collect(int doc) throws IOException {
Document document = searcher.doc(doc);
String filename = document.get(FILENAME_FIELD);
results.add(filename);
countCollect++;
}
#Override
public boolean acceptsDocsOutOfOrder() {
return true;
}
#Override
public void setScorer(Scorer scorer) throws IOException {
this.scorer = scorer;
}
#Override
public void setNextReader(AtomicReaderContext ctx) throws IOException {
this.base = ctx.docBase;
countNextReader++;
}
#Override
public String toString() {
LOG.info("CountCollect: " + countCollect);
LOG.info("CountNextReader: " + countNextReader);
return null;
}
};
searcher.search(query, collector);
collector.toString();
return results;
}
I don't understand why within collect method I get different documents and different file names in comparison with previous implementation? I would expect the same result, or?
The Collector#collect method is the hotspot of a search request. It's called for every document that matches the query, not only the ones that you get back. In fact, you usually get back only the top documents, which are effectively the ones that you show to the users.
I would suggest not to do things like:
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
which would force lucene to return too many documents.
Anyway, if you only have two matching documents (or you are asking for all the documents that match), the number of documents that you get back and the number of calls to the collect method should be the same.
The setNextReader method is something completely different that you shouldn't care that much about. Have a look at this article if you want to know more about AtomicReader and so on. To keep it short, Lucene stores data as segments, which are mini searchable inverted indexes. Every query is executed on each segment sequentially. Every time the search switches to the next segment the setNextReader method is called to allow to do operations at a segment level in the Collector. For example, the internal lucene document id is unique only within the segment, thus you need to add docBase to it to make it unique within the whole index. That's why you need to store it when the segment changes and take it into account. Your countNextReader variable just contains the number of segments that have been analyzed for your query, it doesn't have anything to do with your documents.
Looking deeper at your Collector code I also noticed you are not taking into account the docBase when retrieving documents by id. This should fix it:
Document document = searcher.doc(doc + docBase);
Keep also in mind that loading a stored field within a Collector is not really a wise thing to do. It's gonna make your searches really slow, because stored fields are loaded from disk. You usually load stored fields only for the subset of documents that you want to return. Within a Collector you usually load information needed to score documents like payloads or similar things, usually making use of the lucene field cache too.