I was started working my way through the second edition of 'Lucene in Action' which uses the 3.0 API, the author creates a basic INdexWriter with the following method
private IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
return new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.Unlimited);
}
In the code Below I've made the changes according the current API, with the exception that I cannot figure out how to set the writer's max field length to unlimited like the constant in the book example. I've just inserted the int 1000 below. Is this unlimited constant just gone completely in the current API?
private IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36,
new LimitTokenCountAnalyzer(new WhitespaceAnalyzer(Version.LUCENE_36), 1000));
return new IndexWriter(directory, iwc);
}
Thanks, this is just for curiosity.
IndexWriter javadoc says:
#deprecated use LimitTokenCountAnalyzer instead. Note that the
behvaior slightly changed - the analyzer limits the number of
tokens per token stream created, while this setting limits the
total number of tokens to index. This only matters if you index
many multi-valued fields though.
So, in other words, a hard-wired method has been replaced with a nice adapter/delegate pattern.
Related
Please guide me how to use japanese analyser (lucene-gosen) with Lucene.net. And also suggest me some good analyzer for Lucene.net that support Japanese.
The Lucene-Gosen analyzer does not appear to be ported to Lucene.Net. You can make a request on their github page or you could help them out by porting it and submitting a pull request.
Once that analyzer exists and using the article here - using their basic code, just change the analyzer:
string strIndexDir = #"D:\Index";
Lucene.Net.Store.Directory indexDir = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir));
Analyzer std = new JapaneseAnalyzer(Lucene.Net.Util.Version.LUCENE_29); //Version parameter is used for backward compatibility. Stop words can also be passed to avoid indexing certain words
IndexWriter idxw = new IndexWriter(indexDir, std, true, IndexWriter.MaxFieldLength.UNLIMITED);
//Create an Index writer object.
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field fldText = new Lucene.Net.Documents.Field("text", System.IO.File.ReadAllText(#"d:\test.txt"), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED, Lucene.Net.Documents.Field.TermVector.YES);
doc.Add(fldText);
//write the document to the index
idxw.AddDocument(doc);
//optimize and close the writer
idxw.Optimize();
idxw.Close();
Response.Write("Indexing Done");
Can someone help me in undestanding a way of working with customized implementations of abstract Collector class in Lucene?
I've implemented two ways of querying index with some test text:
1.Total hits is eq to 2. Both file names are the same, hence results size is eq to 1 because I keep them in a set.
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
LOG.info("Total hits " + topDocs.totalHits);
ScoreDoc[] scoreDosArray = topDocs.scoreDocs;
for (ScoreDoc scoreDoc : scoreDosArray) {
Document doc = searcher.doc(scoreDoc.doc);
String fileName = doc.get(FILENAME_FIELD);
results.add(fileName);
}
2.CountCollect is eq to 2. Both documents from which I get files names in collect method of the Collector are unique, hence final results size is also eq to 2. CountNextReader variable is at the end of the logic is eq to 10.
private Set<String> doStreamingSearch(final IndexSearcher searcher, Query query) throws IOException {
final Set<String> results = new HashSet<String>();
Collector collector = new Collector() {
private int base;
private Scorer scorer;
private int countCollect;
private int countNextReader;
#Override
public void collect(int doc) throws IOException {
Document document = searcher.doc(doc);
String filename = document.get(FILENAME_FIELD);
results.add(filename);
countCollect++;
}
#Override
public boolean acceptsDocsOutOfOrder() {
return true;
}
#Override
public void setScorer(Scorer scorer) throws IOException {
this.scorer = scorer;
}
#Override
public void setNextReader(AtomicReaderContext ctx) throws IOException {
this.base = ctx.docBase;
countNextReader++;
}
#Override
public String toString() {
LOG.info("CountCollect: " + countCollect);
LOG.info("CountNextReader: " + countNextReader);
return null;
}
};
searcher.search(query, collector);
collector.toString();
return results;
}
I don't understand why within collect method I get different documents and different file names in comparison with previous implementation? I would expect the same result, or?
The Collector#collect method is the hotspot of a search request. It's called for every document that matches the query, not only the ones that you get back. In fact, you usually get back only the top documents, which are effectively the ones that you show to the users.
I would suggest not to do things like:
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
which would force lucene to return too many documents.
Anyway, if you only have two matching documents (or you are asking for all the documents that match), the number of documents that you get back and the number of calls to the collect method should be the same.
The setNextReader method is something completely different that you shouldn't care that much about. Have a look at this article if you want to know more about AtomicReader and so on. To keep it short, Lucene stores data as segments, which are mini searchable inverted indexes. Every query is executed on each segment sequentially. Every time the search switches to the next segment the setNextReader method is called to allow to do operations at a segment level in the Collector. For example, the internal lucene document id is unique only within the segment, thus you need to add docBase to it to make it unique within the whole index. That's why you need to store it when the segment changes and take it into account. Your countNextReader variable just contains the number of segments that have been analyzed for your query, it doesn't have anything to do with your documents.
Looking deeper at your Collector code I also noticed you are not taking into account the docBase when retrieving documents by id. This should fix it:
Document document = searcher.doc(doc + docBase);
Keep also in mind that loading a stored field within a Collector is not really a wise thing to do. It's gonna make your searches really slow, because stored fields are loaded from disk. You usually load stored fields only for the subset of documents that you want to return. Within a Collector you usually load information needed to score documents like payloads or similar things, usually making use of the lucene field cache too.
i have an httpmodule that logs every visit to the site into a lucene index.
the site is hosted on godaddy and even due i have almost nothing on the page i do the tests on (about 3kb including css), it works slow.
if i try to refresh a few times, after the second or third refresh i would get Lock obtain timed out: SimpleFSLock error.
my question is, am i doing something wrong? or is this normal behavior?
is there any way to overcome this problem?
my code:
//state the file location of the index
string indexFileLocation = System.IO.Path.Combine(HttpContext.Current.ApplicationInstance.Server.MapPath("~/App_Data"), "Analytics");
Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, false);
//create an analyzer to process the text
Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();
//create the index writer with the directory and analyzer defined.
Lucene.Net.Index.IndexWriter indexWriter = new Lucene.Net.Index.IndexWriter(dir, analyzer, false);
//create a document, add in a single field
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
doc.Add(new Lucene.Net.Documents.Field("TimeStamp", DateTime.Now.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED, Lucene.Net.Documents.Field.TermVector.NO));
doc.Add(new Lucene.Net.Documents.Field("IP", request.UserHostAddress.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED, Lucene.Net.Documents.Field.TermVector.NO));
//write the document to the index
indexWriter.AddDocument(doc);
//optimize and close the writer
//indexWriter.Optimize();
indexWriter.Close();
How to use FieldCache in Katta, FieldCache expects IndexReader as arguments, then how to get IndexReader from Katta API. And In katta the search method in LuceneClient.java returns Hits.
From this I can get List, from that I can able to get each hit's docId, but I need particular field value of the docId in Katta. Please give me some coding example.
I've never worked with Katta, I worked with Solr and if I had to get document by its id and I had to use only Lucene classes, I'd use org.apache.lucene.search.IndexSearcher:
// when you figure out how to get IndexReader using Katta API, you'll be able to get the searcher
IndexSearcher searcher = new IndexSearcher(indexReader);
org.apache.lucene.document.Document doc = searcher.doc(docId);
String yourFieldValue = doc.get("yourFieldName");
you can't use the FieldCache on client side, since the IndexReader is located on the server side!
But you can get field-values through the getDetails() method on LuceneClient.
final Hits hits = client.search(query, new String[] { INDEX_NAME }, 10);
for (final Hit hit : hits.getHits()) {
final MapWritable details = client.getDetails(hit, new String[] { "path" });
details.get(new Text("path"));
HTH
Johannes
When should I use Lucene's RAMDirectory? What are its advantages over other storage mechanisms? Finally, where can I find a simple code example?
When you don’t want to permanently store your index data. I use this for testing purposes. Add data to your RAMDirectory, Do your unit tests in RAMDir.
e.g.
public static void main(String[] args) {
try {
Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true);
OR
public void testRAMDirectory () throws IOException {
Directory dir = FSDirectory.getDirectory(indexDir);
MockRAMDirectory ramDir = new MockRAMDirectory(dir);
// close the underlaying directory
dir.close();
// Check size
assertEquals(ramDir.sizeInBytes(), ramDir.getRecomputedSizeInBytes());
// open reader to test document count
IndexReader reader = IndexReader.open(ramDir);
assertEquals(docsToAdd, reader.numDocs());
// open search zo check if all doc's are there
IndexSearcher searcher = new IndexSearcher(reader);
// search for all documents
for (int i = 0; i < docsToAdd; i++) {
Document doc = searcher.doc(i);
assertTrue(doc.getField("content") != null);
}
// cleanup
reader.close();
searcher.close();
}
Usually if things work out with RAMDirectory, it will pretty much work fine with others. i.e. to permanently store your index.
Alternate to this is FSDirectory. You will have to take care of filesystem permissions in this case(which is not valid with RAMDirectory)
Functionally,there is not distinct advantage of RAMDirectory over FSDirectory(other than the fact that RAMDirectory will be visibly faster than FSDirectory). They both server two different needs.
RAMDirectory -> Primary memory
FSDirectory -> Secondary memory
Pretty similar to RAM & Hard disk .
I am not sure what will happen to RAMDirectory if it exceeds memory limit. I’d except a
OutOfMemoryException :
System.SystemException
thrown.