When should I use Lucene's RAMDirectory? What are its advantages over other storage mechanisms? Finally, where can I find a simple code example?
When you don’t want to permanently store your index data. I use this for testing purposes. Add data to your RAMDirectory, Do your unit tests in RAMDir.
e.g.
public static void main(String[] args) {
try {
Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true);
OR
public void testRAMDirectory () throws IOException {
Directory dir = FSDirectory.getDirectory(indexDir);
MockRAMDirectory ramDir = new MockRAMDirectory(dir);
// close the underlaying directory
dir.close();
// Check size
assertEquals(ramDir.sizeInBytes(), ramDir.getRecomputedSizeInBytes());
// open reader to test document count
IndexReader reader = IndexReader.open(ramDir);
assertEquals(docsToAdd, reader.numDocs());
// open search zo check if all doc's are there
IndexSearcher searcher = new IndexSearcher(reader);
// search for all documents
for (int i = 0; i < docsToAdd; i++) {
Document doc = searcher.doc(i);
assertTrue(doc.getField("content") != null);
}
// cleanup
reader.close();
searcher.close();
}
Usually if things work out with RAMDirectory, it will pretty much work fine with others. i.e. to permanently store your index.
Alternate to this is FSDirectory. You will have to take care of filesystem permissions in this case(which is not valid with RAMDirectory)
Functionally,there is not distinct advantage of RAMDirectory over FSDirectory(other than the fact that RAMDirectory will be visibly faster than FSDirectory). They both server two different needs.
RAMDirectory -> Primary memory
FSDirectory -> Secondary memory
Pretty similar to RAM & Hard disk .
I am not sure what will happen to RAMDirectory if it exceeds memory limit. I’d except a
OutOfMemoryException :
System.SystemException
thrown.
Related
How one would write a Lucene 8.11 ByteBuffersDirectory to disk?
something similar to Lucene 2.9.4 Directory.copy(directory, FSDirectory.open(indexPath), true)
You can use the copyFrom method to do this.
For example:
You are using a ByteBuffersDirectory:
final Directory dir = new ByteBuffersDirectory();
Assuming you are not concurrently writing any new data to that dir, you can declare a target where you want to write the data - for example, a FSDirectory (a file system directory):
Directory to = FSDirectory.open(Paths.get(OUT_DIR_PATH));
Use whatever string you want for the OUT_DIR_PATH location.
Then you can iterate over all the files in the original dir object, writing them to this new to location:
IOContext ctx = new IOContext();
for (String file : dir.listAll()) {
System.out.println(file); // just for testing
to.copyFrom(dir, file, file, ctx);
}
This will create the new OUT_DIR_PATH dir and populate it with files, such as:
_0.cfe
_0.cfs
_0.si
segments_1
... or whatever files you happen to have in your dir.
Caveat:
I have only used this with a default IOContext object. There are other constructors for the context - not sure what they do. I assume they give you more control over how the write is performed.
Meanwhile I figured it out by myself and created a straight forward method for it:
#SneakyThrows
public static void copyIndex(ByteBuffersDirectory ramDirectory, Path destination) {
FSDirectory fsDirectory = FSDirectory.open(destination);
Arrays.stream(ramDirectory.listAll())
.forEach(fileName -> {
try {
// IOContext is null because in fact is not used (at least for the moment)
fsDirectory.copyFrom(ramDirectory, fileName, fileName, null);
} catch (IOException e) {
log.error(e.getMessage(), e);
}
});
}
PDFBox api is working fine for less number of files. But i need to merge 10000 pdf files into one, and when i pass 10000 files(about 5gb) it's taking 5gb ram and finally goes out of memory.
Is there some implementation for such requirement in PDFBox.
I tried to tune it for that i used AutoClosedInputStream which gets closed automatically after read, But output is still same.
I have a similar scenario here, but I need to merge only 1000 documents in a single one.
I tried to use PDFMergerUtility class, but I getting an OutOfMemoryError. So I did refactored my code to read the document, load the first page (my source documents have one page only), and then merge, instead of using PDFMergerUtility. And now works fine, with no more OutOfMemoryError.
public void merge(final List<Path> sources, final Path target) {
final int firstPage = 0;
try (PDDocument doc = new PDDocument()) {
for (final Path source : sources) {
try (final PDDocument sdoc = PDDocument.load(source.toFile(), setupTempFileOnly())) {
final PDPage spage = sdoc.getPage(firstPage);
doc.importPage(spage);
}
}
doc.save(target.toAbsolutePath().toString());
} catch (final IOException e) {
throw new IllegalStateException(e);
}
}
How to remove stop words in Lucene for the given String "This is the chemical orientation"
I think that Lucene's StopFilter is what you are looking for.
you should use standardAnalyser ,that knows about certain token types, lowercases, removes stop words, ...
example of creating an IndexWriter with standardAnalyser:
public IndexWriter Indexer(String dir) throws IOException {
IndexWriter writer;
Directory indexDir = FSDirectory.open(new File(dir).toPath());
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig cfg = new IndexWriterConfig(analyzer);
cfg.setOpenMode(OpenMode.CREATE);
writer = new IndexWriter(indexDir, cfg);
return writer;
}
i have an httpmodule that logs every visit to the site into a lucene index.
the site is hosted on godaddy and even due i have almost nothing on the page i do the tests on (about 3kb including css), it works slow.
if i try to refresh a few times, after the second or third refresh i would get Lock obtain timed out: SimpleFSLock error.
my question is, am i doing something wrong? or is this normal behavior?
is there any way to overcome this problem?
my code:
//state the file location of the index
string indexFileLocation = System.IO.Path.Combine(HttpContext.Current.ApplicationInstance.Server.MapPath("~/App_Data"), "Analytics");
Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, false);
//create an analyzer to process the text
Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();
//create the index writer with the directory and analyzer defined.
Lucene.Net.Index.IndexWriter indexWriter = new Lucene.Net.Index.IndexWriter(dir, analyzer, false);
//create a document, add in a single field
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
doc.Add(new Lucene.Net.Documents.Field("TimeStamp", DateTime.Now.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED, Lucene.Net.Documents.Field.TermVector.NO));
doc.Add(new Lucene.Net.Documents.Field("IP", request.UserHostAddress.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED, Lucene.Net.Documents.Field.TermVector.NO));
//write the document to the index
indexWriter.AddDocument(doc);
//optimize and close the writer
//indexWriter.Optimize();
indexWriter.Close();
I was started working my way through the second edition of 'Lucene in Action' which uses the 3.0 API, the author creates a basic INdexWriter with the following method
private IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
return new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.Unlimited);
}
In the code Below I've made the changes according the current API, with the exception that I cannot figure out how to set the writer's max field length to unlimited like the constant in the book example. I've just inserted the int 1000 below. Is this unlimited constant just gone completely in the current API?
private IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36,
new LimitTokenCountAnalyzer(new WhitespaceAnalyzer(Version.LUCENE_36), 1000));
return new IndexWriter(directory, iwc);
}
Thanks, this is just for curiosity.
IndexWriter javadoc says:
#deprecated use LimitTokenCountAnalyzer instead. Note that the
behvaior slightly changed - the analyzer limits the number of
tokens per token stream created, while this setting limits the
total number of tokens to index. This only matters if you index
many multi-valued fields though.
So, in other words, a hard-wired method has been replaced with a nice adapter/delegate pattern.