Configuring Lucene Index writer, controlling the segment formation (setRAMBufferSizeMB) - lucene

How to set the parameter - setRAMBufferSizeMB? Is depending on the RAM size of the Machine? Or Size of Data that needs to be Indexed? Or any other parameter? could someone please suggest an approach for deciding the value of setRAMBufferSizeMB.

So, what we have about this parameter in Lucene javadoc:
Determines the amount of RAM that may be used for buffering added
documents and deletions before they are flushed to the Directory.
Generally for faster indexing performance it's best to flush by RAM
usage instead of document count and use as large a RAM buffer as you
can. When this is set, the writer will flush whenever buffered
documents and deletions use this much RAM.
The maximum RAM limit is inherently determined by the JVMs available
memory. Yet, an IndexWriter session can consume a significantly larger
amount of memory than the given RAM limit since this limit is just an
indicator when to flush memory resident documents to the Directory.
Flushes are likely happen concurrently while other threads adding
documents to the writer. For application stability the available
memory in the JVM should be significantly larger than the RAM buffer
used for indexing.
By default, Lucene uses 16 Mb as this parameter (this is the indication to me, that you shouldn't have that much big parameter to have fine indexing speed). I would recommend you to tune this parameter by setting it let's say to 500 Mb and checking how well your system behave. If you will have crashes, you could try some smaller value like 200 Mb, etc. until your system will be stable.
Yes, as it stated in the javadoc, this parameter depends on the JVM heap, but for Python, I think it could allocate memory without any limit.

Related

Transferring memory from GPU to CPU with Vulkan and vkInvalidateMappedMemoryRanges synchronization?

In Vulkan, when I want to transfer some memory the GPU back to the CPU, I think the most efficient way to do this is to write the data into memory which has the flags VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT.
Question #1: Is that assumption correct?
(Full list of available memory property flags can be found in Vulkan's documentation of VkMemoryPropertyFlagBits)
In order to get the latest data, I have to invalidate the memory using vkInvalidateMappedMemoryRanges, right?
Question #2: What is happening under the hood during vkInvalidateMappedMemoryRanges? Is this just a memcpy from some internal cache or can this be a longer procedure?
Question #3: If this could take longer (i.e. it is not a simple memcpy), then I probably should have some possibility to synchronize with the completion of it, right? However, vkInvalidateMappedMemoryRanges does not offer any synchronization parameters. Actually, my question is: IF I have to synchronize it, HOW do I synchronize it?
Question #1: Is that assumption correct?
Probably not, but it depends on your platform whether you support the alternative. For GPU->CPU transfers there are really three options:
1. HOST_VISIBLE
This type is visible to the host and guaranteed to be coherent, but not cached on the host. CPU reads will be very slow but that might be OK if you are only reading back a small amount of data (and might be cheaper than issuing vkInvalidateMappedMemoryRanges(), and there is little point streaming data into the CPU cache if you never expect to touch it again on the CPU).
2. HOST_VISIBLE | HOST_CACHED
This type is visible to the host and cached, but not guaranteed to be coherent (CPU and GPU might see different things at the same address if you don't manually enforce coherency). For this type of memory you must use vkInvalidateMappedMemoryRanges() after GPU writes and before CPU reads (or vkFlushMappedRange() for the other direction) to ensure that one processor can see what the other wrote, or you might read stale data.
Data access will be fast once in the cache, and you can benefit from CPU-side data fetch tricks such as explcit preloads and cache prefetching, but you will pay an overhead for the invalidate operation.
3. HOST_VISIBLE | HOST_CACHED | HOST_COHERENT
Finally you have the host cached AND coherent memory type, which sort of gives you best of both if you have high bandwidth reads on the CPU to make. Hardware provides the coherency implementation automatically, so no need to invalidate, BUT it's not guaranteed to be available on all platforms. For bulk data reads on the CPU I would expect this to be the most efficient in cases where it is available.
It's worth noting that there is no "best" memory settings for all allocations. Do not use host cached or host coherent memory for things you never expect to transfer back to the CPU (memory coherency isn't free in terms of power or memory performance).
Question #2: What is happening under the hood during vkInvalidateMappedMemoryRanges? Is this just a memcpy from some internal cache or can this be a longer procedure?
In the case where you have non-coherent memory then it does whatever is needed to make them coherent. Typically this means invalidating (discarding) cache lines in CPU cache which may contain stale copies of the data, ensuring that subsequent reads by the CPU see the version that the GPU actually wrote.
Question #3: If this could take longer (i.e. it is not a simple memcpy), then I probably should have some possibility to synchronize with the completion of it, right?
No. Invalidation is a CPU-side operation, so it takes CPU time to complete and the CPU is busy while the operation is completing. In general you can avoid the need to do it at all by using coherent memory though.

Is redis using cache?

I restarted my redis server after 120 days.
Before restart, memory usage 29.5GB
After restarted, memory usage 27.5GB
So, how 2GB reduced comes?
Free memory in ram like this article https://redis.io/topics/memory-optimization
Redis will not always free up (return) memory to the OS when keys are
removed. This is not something special about Redis, but it is how most
malloc() implementations work. For example if you fill an instance
with 5GB worth of data, and then remove the equivalent of 2GB of data,
the Resident Set Size (also known as the RSS, which is the number of
memory pages consumed by the process) will probably still be around
5GB, even if Redis will claim that the user memory is around 3GB. This
happens because the underlying allocator can't easily release the
memory. For example often most of the removed keys were allocated in
the same pages as the other keys that still exist. The previous point
means that you need to provision memory based on your peak memory
usage. If your workload from time to time requires 10GB, even if most
of the times 5GB could do, you need to provision for 10GB.
However allocators are smart and are able to reuse free chunks of
memory, so after you freed 2GB of your 5GB data set, when you start
adding more keys again, you'll see the RSS (Resident Set Size) to stay
steady and don't grow more, as you add up to 2GB of additional keys.
The allocator is basically trying to reuse the 2GB of memory
previously (logically) freed.
Because of all this, the fragmentation ratio is not reliable when you
had a memory usage that at peak is much larger than the currently used
memory. The fragmentation is calculated as the amount of memory
currently in use (as the sum of all the allocations performed by
Redis) divided by the physical memory actually used (the RSS value).
Because the RSS reflects the peak memory, when the (virtually) used
memory is low since a lot of keys / values were freed, but the RSS is
high, the ratio mem_used / RSS will be very high.
Or free memory of caches which was used by my redis server?
Is redis using cache? Cache of cache?
Thanks!

JVM Option MaxMetaspaceSize - What is Optimal value to set?

I have application, have set around 8GB of Max heap memory, it does lots of message processing. when I am taking jmap -heap for my application, it shows me very huge MaxMetaspaceSize memory, something like below,
MaxMetaspaceSize = 17592186044415 MB
If I have to put the limit on MaxMetaspaceSize, what could be optimal value for that?
Note* - I have read on pros and cons of putting limit of MaxMetaspaceSize
Measure how much metaspace your program needs once it has settled into a steady state. Add some safety margin, then configure that as your max.
Should your program change behavior, e.g. load additional classes without unloading old ones you'll be notified about that through OOMEs.
it shows me very huge MaxMetaspaceSize memory
Note that that is just the upper limit enforced by the JVM, not the actual size.

Is there any advantage in setting Xms and Xmx to the same value?

Usually I set -Xms512m and -Xmx1g so that when JVM starts it allocates 512MB and gradually increases heap to 1GB as necessary. But I see these values set to same say 1g in a dedicated server instance. Is there any advantage for the having both set to the same value?
Well there are couple of things.
Program will start with -Xms value and if the value is lesser it will eventually force GC to occur more frequently
Once the program reaches -Xms heap, jvm request OS for additional memory and eventually grabs -Xmx that requires additional time leading to performance issue, you might as well set it to that at the beginning avoiding jvm to request additional memory.
It is very nicely answered here - https://developer.jboss.org/thread/149559?_sscc=t
From Oracle Java SE 8 docs:
https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html
By default, the virtual machine grows or shrinks the heap at each
collection to try to keep the proportion of free space to live objects
at each collection within a specific range. This target range is set
as a percentage by the parameters -XX:MinHeapFreeRatio=<minimum> and
-XX:MaxHeapFreeRatio=<maximum>, and the total size is bounded below by -Xms<min> and above by -Xmx<max>. Setting -Xms and -Xmx to the same value increases predictability by removing the most important sizing
decision from the virtual machine. However, the virtual machine is
then unable to compensate if you make a poor choice.
if the value of -Xms and -Xmx is same JVM will not have to adjust the heap size and that means less work by JVM and more time to your application. but if the chosen value is a poor choice for -Xms then some of the memory allocated will never be used because the heap will never shrink and if it is a poor choice for -Xmx you will get OutOfMemoryError.
AFAIK One more reason, is that expansion of heap is a stop-the-world event; setting those to the same value will prevent that.
There are some advantages.
if you know the size is going to grow to the maximum, e.g. in a benchmark, you may as well start with the size you know you need.
you can get better performance giving the program more memory that it might naturally give itself. YMWV
In general, I would make the Xms a value I am confident it will use, and the double this for head room for future use cases or situations we haven't tested for. i.e. a size we don't expect but it might use.
In short, the maximum is the point you would rather the program fail than use any more.
Application will suffer frequent GC with lower -Xms value.
Every time asking for more memory from OS with consume time.
Above all, if your application is performance critical then you would certainly want to avoid memory pages swapping out to/from disk as this will cause GC consuming more time. To avoid this, memory can be locked. But if Xms and Xmx are not same then memory allocated after initial allocation will not be locked.

Memory usage keep growing while writing the Lucene.Net index

I open this discussion since googling about the Lucene.Net usage I have not found anything really useful.
The issue is simple: I am experiencing a problem in building and updating the Lucene.Net index. In particular its memory usage keeps growing while even if I fix the SetRAMBufferSizeMB to 256, SetMergeFactor to 100 and SetMaxMergeDocs to 100000. Moreover I use carefully the Close() and Commit() methods every time the index is used.
To make lucene.Net works for my data I started from this tutorial: http://www.lucenetutorial.com/lucene-in-5-minutes.html
It seems that for 10^5 and 10^6 documents 1.8GB of ram are necessary. Therefore, why have I to set SetRAMBufferSizeMB parameter if the actual RAM usage is 7 times more? Does anyone really know how to keep the memory usage bound?
Moreover, I observed that to deal with 10^5 or 10^6 documents it is necessary to compile Lucene.Net for an x64 platform. Indeed if I compile the code for an x86 platform the indexing crash systematically touching 1.2GB of RAM.
Does anyone is able to index the same amount of documents (or even more) using less RAM? In which hardware and software setting? My environment configuration is the following:
- os := win7 32/64 bits.
- sw := .Net framework 4.0
- hw := a 12 core Xeon workstation with 6GB of RAM.
- Lucene.Net rel.: 2.9.4g (current stable).
- Lucene.Net directory type: FSDirectory (the index is written into the disk).
OK, I tested the code using your advice on the re-usage of Document/Fields instances however the code performs exactly the same in terms of memory usage.
Here I post few debugging lines for some parameters I have tracked during the indexing process of 1000000 documents.
DEBUG - BuildIndex – IndexWriter - RamSizeInBytes 424960B; index process dimension 1164328960B. 4% of the indexing process.
DEBUG - BuildIndex – IndexWriter - RamSizeInBytes 457728B; index process dimension 1282666496B. 5% of the indexing process.
DEBUG - BuildIndex – IndexWriter - RamSizeInBytes 457728B; index process dimension 1477861376B. 6% of the indexing process.
The index process dimension is obtained as follows:
It is easy to observe how fast the process grows in RAM (~1.5GB at the 6% of the indexing process) even if the RAM buffer exploited by the IndexWriter is more or less unchanged. Therefore, the question is: is it possible to explicitly limit the RAM usage of the indexing process size? I do not care if the performances drop down during the searching phase and if I have to wait for a while for a complete index, but I need to be sure that the indexing process does not hit an OOM or a stack overflow error indexing a large number of documents. How can I do that if it is impossible to limit the memory usage?
For completeness, I post the code used for the debugging:
// get the current process
Process currentProcess = System.Diagnostics.Process.GetCurrentProcess();
// get the physical mem usage of the index writer
long totalBytesOfIndex = writer.RamSizeInBytes();
// get the physical mem usage
long totalBytesOfMemoryUsed = currentProcess.WorkingSet64;
Finally, I found the bug. it is contained into the ItalianAnalyzer (the analiser for the Italian Language) which has been built exploiting the Luca Gentili contribution (http://snowball.tartarus.org/algorithms/italian/stemmer.html). Indeed inside the ItalianAnalyzer class a file containing the stop words was open several times and after each usage it was not closed. This was the reason of the OOM problem for me.
Solving this bug Lucene.Net is light-speed both to build the index and search.
The SetRAMBufferSizeMB is just one of the way to determine when to flush the IndexWriter to disk. It will flush segments data when XXX MB are writen to memory and ready to be flushed to disk.
There are lots of other objects in Lucene that will also use memory, and have nothing to do with the RamBuffer.
Usually, the first thing to try when you run OOM while indexing is to re-use Document/Fields instances. If you multithread indexation, make sure you only reuse them on the same thread. It happened to me to run OOM because of that, when the underlying IO is blazing fast and the .NET garbage collector just cant keep up with all the small objects created.