Lucene index files changing constantly even when there is no adding, updating, or deletion operations performed on it - lucene

I have noticed that, my lucene index segment files (file names) are always changing constantly, even when I am not performing any add, update, or delete operations. The only operations I am performing is reading and searching. So, my question is, does Lucene index segment files get updated internally somehow just from reading and searching operations?
I am using Lucene.Net v4.8 beta, if that matters. Thanks!
Here is an example of how I found this issue (I wanted to get the index size). Assuming a Lucene Index already exists, I used the following code to get the index size:
Example:
private long GetIndexSize()
{
var reader = GetDirectoryReader("validPath");
long size = 0;
foreach (var fileName in reader.Directory.ListAll())
{
size += reader.Directory.FileLength(fileName);
}
return size;
}
private DirectoryReader GetDirectoryReader(string path)
{
var directory = FSDirectory.Open(path);
var reader = DirectoryReader.Open(directory);
return reader;
}
The above method is called every 5 minutes. It works fine ~98% of the time. However, the other 2% of the time, I would get the error file not found in the foreach loop, and after debugging, I saw that the files in reader.Directory are changing in count. The index is updated at certain times by another service, but I can assure that no updates were made to the index anywhere near the times when this error occurs.

Since you have multiple processes writing/reading the same set of files, it is difficult to isolate what is happening. Lucene.NET does locking and exception handling to ensure operations can be synced up between processes, but if you read the files in the directory directly without doing any locking, you need to be prepared to deal with IOExceptions.
The solution depends on how up to date you need the index size to be:
If it is okay to be a bit out of date, I would suggest using DirectoryInfo.EnumerateFiles on the directory itself. This may be a bit more up to date than Directory.ListAll() because that method stores the file names in an array, which may go stale before the loop is done. But, you still need to catch FileNotFoundException and ignore it and possibly deal with other IOExceptions.
If you need the size to be absolutely up to date and plan to do an operation that requires the index to be that size, you need to open a write lock to prevent the files from changing while you get the value.
private long GetIndexSize()
{
// DirectoryReader is superfluous for this example. Also,
// using a MMapDirectory (which DirectoryReader.Open() may return)
// will use more RAM than simply using SimpleFSDirectory.
var directory = new SimpleFSDirectory("validPath");
long size = 0;
// NOTE: The lock will stay active until this is disposed,
// so if you have any follow-on actions to perform, the lock
// should be obtained before calling this method and disposed
// after you have completed all of your operations.
using Lock writeLock = directory.MakeLock(IndexWriter.WRITE_LOCK_NAME);
// Obtain exclusive write access to the directory
if (!writeLock.Obtain(/* optional timeout */))
{
// timeout failed, either throw an exception or retry...
}
foreach (var fileName in directory.ListAll())
{
size += directory.FileLength(fileName);
}
return size;
}
Of course, if you go that route, your IndexWriter may throw a LockObtainFailedException and you should be prepared to handle them during the write process.
However you deal with it, you need to be catching and handling exceptions because IO by its nature has many things that can go wrong. But exactly how you deal with it depends on what your priorities are.
Original Answer
If you have an IndexWriter instance open, Lucene.NET will run a background process to merge segments based on the MergePolicy being used. The default settings can be used with most applications.
However, the settings are configurable through the IndexWriterConfig.MergePolicy property. By default, it uses the TieredMergePolicy.
var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
MergePolicy = new TieredMergePolicy()
};
There are several properties on TieredMergePolicy that can be used to change the thresholds that it uses to merge.
Or, it can be changed to a different MergePolicy implementation. Lucene.NET comes with:
LogByteSizeMergePolicy
LogDocMergePolicy
NoMergePolicy
TieredMergePolicy
UpgradeIndexMergePolicy
SortingMergePolicy
The NoMergePolicy class can be used to disable merging entirely.
If your application never needs to add documents to the index (for example, if the index is built as part of the application deployment), it is also possible to use a IndexReader from a Directory instance directly, which does not do any background segment merges.
The merge scheduler can also be swapped and/or configured using the IndexWriterConfig.MergeScheduler property. By default, it uses the ConcurrentMergeScheduler.
var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
MergePolicy = new TieredMergePolicy(),
MergeScheduler = new ConcurrentMergeScheduler()
};
The merge schedulers that are included with Lucene.NET 4.8.0 are:
ConcurrentMergeScheduler
NoMergeScheduler
SerialMergeScheduler
The NoMergeScheduler class can be used to disable merging entirely. This has the same effect as using NoMergePolicy, but also prevents any scheduling code from being executed.

Related

What could be the reasons that .net-core does not release large objects (under linux) right away?

I have the following code snippet, that downloads an about 100mb big zip file that then will be deserialzed into a Foo for further processing. This code is execcuted once per day.
Stream stream = await downloader.DownloadStream().ConfigureAwait(false);
if (stream.CanSeek && stream.Length == 0)
{
throw new IOException("Error! Received stream is empty.");
}
ZipArchive archive;
try
{
archive = new ZipArchive(stream);
}
catch (InvalidDataException)
{
throw new IOException("Error! Received stream is no zip file.");
}
using (var reader = new StreamReader(archive.Entries[0].Open()))
{
Foo foo = JsonConvert.DeserializeObject<Foo>(reader.ReadToEnd());
[...]
}
archive.Dispose();
stream.Dispose();
Now, when that code kicks in, the memory usage jumps as high as about 2500mb. So far so good, and I would say, pretty common.
The (maybe?) uncommon thing is, that the memory sometimes takes days to be released. Now, when running that code on multiple instances, let's say inside k8s, you would see the following pattern of memory usage.
So you see, the memory is eventually released, so I doubt I have some error inside the code regarding a possible leak. The question is if this is expected behavior of the clr or if I maybe can enhance this behavior. What I already read is that this would definitely involve the large object heap. But it's hard do find internals how that heap behaves natively under linux.
So for hosting this in k8s, this means that I need to plan the resources for the worst case scenario, that all hosted instances will have about 2500mb allocated.

How to use for loop instead of foreach loop in traversing GetProcessByName

I had been searching through the internet for getting all the processes of an application. And so far all the implementation of traversing it is by using foreach loop which I'm not familiar with. It works but I just can't rest easy for it working without me getting to understand it. So I'd like to ask if someone can show me how to implement such code using for loop.
System::Diagnostics::Process^ current = System::Diagnostics::Process::GetCurrentProcess();
for each (System::Diagnostics::Process^ process in System::Diagnostics::Process::GetProcessesByName(current->ProcessName))
if (process->Id != current->Id)
{
// process already exist
}
I'm using visual studio c++/clr btw, hence :: since it's not in c#.
GetProcessesByName returns a cli::array, so iterate using that and its length property.
cli::array<Process^>^ processes = Process::GetProcessesByName(current->ProcessName);
for (int i = 0; i < processes->Length; i++)
{
if (processes[i]->Id != current->Id)
{
// process already exist
}
}
That said, there might be a better way to do this.
It looks like you're trying to see if there's another copy of your application running, so that you can display an "App is already running, switch to that one instead" message to the user.
The problem is that the process name here is just the filename of your EXE, without the path or the "EXE" extension. Try renaming your application to "Notepad.exe", and run a couple copies of the Windows Notepad, and you'll see that they both show up in the GetProcessesByName result.
A better way to do this is to create a named Mutex, and check for its existence and/or lock status at startup.
Here's an answer that does just that. Prevent multiple instances of a given app in .NET? It is written in C#, but it can be translated to C++/CLI. The only notable thing about the translation is that it's using a C# using statement; in C++/CLI that would be the line Mutex mutex(false, "whatever string");, using C++/CLI's stack semantics.

JavaCPP Leptonica : How to clear memory of pixClone handles

Until now, I've always used pixDestroy to clean up PIX objects in my JavaCPP/Leptonica application. However, I recently noticed a weird memory leak issue that I tracked down to a Leptonica function internally returning a pixClone result. I managed to reproduce the issue by using the following simple test:
#Test
public void test() throws InterruptedException {
String pathImg = "...";
for (int i = 0; i < 100; i++) {
PIX img = pixRead(pathImg);
PIX clone = pixClone(img);
pixDestroy(clone);
pixDestroy(img);
}
Thread.sleep(10000);
}
When the Thread.sleep is reached, the RAM memory usage in Windows task manager (not the heap size) has increased to about 1GB and is not released until the sleep ends and the test finishes.
Looking at the docs of pixClone, we see it actually creates a handle to the existing PIX:
Notes:
A "clone" is simply a handle (ptr) to an existing pix. It is implemented because (a) images can be large and hence expensive to
copy, and (b) extra handles to a data structure need to be made with a
simple policy to avoid both double frees and memory leaks. Pix are
reference counted. The side effect of pixClone() is an increase by 1
in the ref count.
The protocol to be used is: (a) Whenever you want a new handle to an existing image, call pixClone(), which just bumps a ref count. (b)
Always call pixDestroy() on all handles. This decrements the ref
count, nulls the handle, and only destroys the pix when pixDestroy()
has been called on all handles.
If I understand this correctly, I am indeed calling pixDestroy on all handles, so the ref count should reach zero and thus the PIX should have been destroyed. Clearly, this is not the case though. Can someone tell me what I'm doing wrong? Thanks in advance!
As an optimization for the common case when a function returns a pointer it receives as argument, JavaCPP also returns the same object to the JVM. This is what is happening with pixClone(). It simply returns the pointer that the user passes as argument, and thus both img and clone end up referencing the same object in Java.
Now, when pixDestroy() gets called on the first reference img, Leptonica helpfully resets its address to 0, but we've now lost the address, and the second call to pixDestroy() receives that null pointer, resulting in a noop, and a memory leak.
One easy way to avoid this issue is by creating explicitly a new PIX reference after each call to pixClone(), for example, in this case:
PIX clone = new PIX(pixClone(img));

FreeMarker ?has_content on Iterator

How does FreeMarker implement .iterator()?has_content for Iterator ?
Does it attempt to get the first item to decide whether to render,
and does it keep it for the iteration? Or does it start another iteration?
I have found this
static boolean isEmpty(TemplateModel model) throws TemplateModelException
{
if (model instanceof BeanModel) {
...
} else if (model instanceof TemplateCollectionModel) {
return !((TemplateCollectionModel) model).iterator().hasNext();
...
}
}
But I am not sure what Freemarker wraps the Iterator type to.
Is it TemplateCollectionModel?
It doesn't get the first item, it just calls hasNext(), and yes, with the default ObjectWrapper (see the object_wrapper configuration setting) it will be treated as a TemplateCollectionModel, which can be listed only once though. But prior to 2.3.24 (it's unreleased when I write this), there are some glitches to look out for (see below). Also if you are using pure BeansWrapper instead of the default ObjectWrapper, there's a glitch (see below too).
Glitch (2.3.23 and earlier): If you are using the default ObjectWrapper (you should), and the Iterator is not re-get for the subsequent listing, that is, the same already wrapped value is reused, then it will freak out telling that an Iterator can be listed only once.
Glitch 2, with pure BeansWrapper: It will just always say that it has content, even if the Iterator is empty. That's also fixed in 2.3.24, however, you need to create the BeansWrapper itself (i.e., not (just) the Configuration) with 2.3.24 incompatibleImprovements for the fix to be active.
Note that <#list ...>...<#else>...</#list> has been and is working in all cases (even before 2.3.24).
Last not least, thanks for bringing this topic to my attention. I have fixed these in 2.3.24 because of that. (Nightly can be built from here: https://github.com/apache/incubator-freemarker/tree/2.3-gae)

Nested transactions in LINQ to SQL

I need help with realizing quite complex business logic which operates on many tables and executes quite a few SQL commands. However I want to be sure that the data will not be left in incosistent state and to this moment I don't see the solution which would not require nested transactions. I wrote a simple pseudo-code which illustrates a scenario similar to what I want to accomplish:
Dictionary<int, bool> opSucceeded = new Dictionary<int, bool> ();
for (int i = 0; i < 10; i++)
{
try
{
// this operation must be atomic
Operation(dbContext, i);
// commit (?)
opSucceeded[i] = true;
}
catch
{
// ignore
}
}
try
{
// this operation must know which Operation(i) has succeeded;
// it also must be atomic
FinalOperation(dbContext, opSucceeded);
// commit all
}
catch
{
// rollback FinalOperation and operation(i) where opSucceeded[i] == true
}
The biggest problem for me is: how to ensure that if the FinalOperation fails, all operations Operation(i) which succeeded are rolled back? Note that I also would like to be able to ignore failures of single Operation(i).
Is it possible to achieve this by using nested TransactionScope objects and if not - how would you approach such problem?
If I am following your question, you want to have a series of operations against the database, and you capture enough information to determine if each operating succeeds or fails (the dictionary in your simplified code).
From there, you have a final operation that must roll back all of the successful operations from earlier if it fails itself.
It would seem this is exactly the type of case that a simple transaction is for. There is no need to keep track of the success or failure of the child/early operations as long as failure of the final operation rolls the entire transaction back (here assuming that FinalOperation isn't using that information for other reasons).
Simply start a transaction before you enter the block described, and commit or rollback the entire thing after you know the status of your FinalOperation. There is no need to nest the child operations as far as I can see from your current description.
Perhaps I a missing something? (Note, if you wanted to RETAIN the earlier/child operations, that would be something different entirely... but a failure of the final op rolling the whole package of operations back makes the simple transaction usable).