How can I set maxChunkSize for MMapDirectoryFactory in solr? - indexing

I was reading through the MMapDirectoryFactory and how to set it as DirectoryFactory to use for indexes in my core's solrconfig.xml.
<directoryFactory name="DirectoryFactory"
class="solr.MMapDirectoryFactory">
<bool name="preload">true</bool>
</directoryFactory>
I am not able to understand nor able to find any example about how to set the maxChunkSize.
references:
https://solr.apache.org/guide/8_5/datadir-and-directoryfactory-in-solrconfig.html-
https://solr.apache.org/docs/8_5_0/solr-core/org/apache/solr/core/MMapDirectoryFactory.html

<directoryFactory name="DirectoryFactory"
class="${solr.directoryFactory:solr.MMapDirectoryFactory}">
<int name="maxChunkSize">1073741824</int>
<bool name="preload">true</bool>
</directoryFactory>
I hope you are looking for this.
Please note there is a validation in place for chunk size.
public MMapDirectory(Path path, LockFactory lockFactory, int maxChunkSize) throws IOException {
...
this.chunkSizePower = 31 - Integer.numberOfLeadingZeros(maxChunkSize);
assert this.chunkSizePower >= 0 && this.chunkSizePower <= 30;
}
However, that said, you ideally would not be required to modify this param for modern production systems (as they would usually be 64-bit already) per the java doc.
Create a new MMapDirectory for the named location, specifying the maximum chunk size used for memory mapping. The directory is created at the named location if it does not yet exist.
Especially on 32 bit platform, the address space can be very fragmented, so large index files cannot be mapped. Using a lower chunk size makes the directory implementation a little bit slower (as the correct chunk may be resolved on lots of seeks) but the chance is higher that mmap does not fail. On 64 bit Java platforms, this parameter should always be 1 << 30, as the address space is big enough.

Related

how do i know in advance that the buffer size is enough in nanopb?

im trying to use nanopb, according to the example:
https://github.com/nanopb/nanopb/blob/master/examples/simple/simple.c
the buffer size is initialized to 128:
uint8_t buffer[128];
my question is how do i know (in advance) this 128-length buffer is enough to transmit my message? how to decide a proper(enough but not waste too much due to over-large) size of buffer before initial (or coding) it?
looks like a noob question :) , but thx for your quick suggestion.
When possible, nanopb adds a define in the generated .pb.h file that has the maximum encoded size of a message. In the file examples/simple/simple.pb.h you'll find:
/* Maximum encoded size of messages (where known) */
#define SimpleMessage_size 11
And could specify uint8_t buffer[SimpleMessage_size];.
This define will be available only if all repeated and string fields have been specified (nanopb).max_count and (nanopb).max_size options.
For many practical purposes, you can pick a buffer size that you estimate will be large enough, and handle error conditions. It is also possible to use pb_get_encoded_size() to calculate the encoded size and dynamically allocate storage, but in general that is not a great solution in embedded applications. When total system memory size is limited, it is often better to have a constant sized buffer that you can test with, instead of having the available amount of dynamic memory vary at the runtime.

How to use VkPipelineCache?

If I understand it correctly, I'm supposed to create an empty VkPipelineCache object, pass it into vkCreateGraphicsPipelinesand data will be written into it. I can then reuse it with other pipelines I'm creating or save it to a file and use it on the next run.
I've tried following the LunarG example to extra the info:
uint32_t headerLength = pData[0];
uint32_t cacheHeaderVersion = pData[1];
uint32_t vendorID = pData[2];
uint32_t deviceID = pData[3];
But I always get headerLength is 32 and the rest 0. Looking at the spec (https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch09s06.html Table 9.1), the cacheHeaderVersion should always be 1, as the only available cache header version is VK_PIPELINE_CACHE_HEADER_VERSION_ONE = 1.
Also the size of pData is usually only 32 bytes, even when I create 10 pipelines with it. What am I doing wrong?
A Vulkan pipeline cache is an opaque object that's only meaningful to the driver. There are very few operations that you're supposed to use on it.
Creating a pipeline cache, optionally with a block of opaque binary data that was saved from an earlier run
Getting the opaque binary data from an existing pipeline cache, typically to serialize to disk before exiting your application
Destroying a pipeline cache as part of the proper shutdown process.
The idea is that the driver can use the cache to speed up creation of pipelines within your program, and also to speed up pipeline creation on subsequent runs of your application.
You should not be attempting to interpret the cache data returned from vkGetPipelineCacheData at all. The only purpose for that data is to be passed into a later call to vkCreatePipelineCache.
Also the size of pData is usually only 32 bytes, even when I create 10 pipelines with it. What am I doing wrong?
Drivers must implement vkCreatePipelineCache, vkGetPipelineCacheData, etc. But they don't actually have to support caching. So if you're working with a driver that doesn't have anything it can cache, or hasn't done the work to support caching, then you'd naturally get back an empty cache (other than the header).

OutOfMemory in Apache Lucene

I'm getting OutOfMemory error in Apache Lucene.
Here is the problem code:
DirectoryReader oldReader = directoryReader;
DirectoryReader newReader = DirectoryReader.openIfChanged(directoryReader);
if ((newReader != null) & (oldReader != newReader)) {
directoryReader = newReader;
}
Here is the log:
Caused by: java.lang.OutOfMemoryError: Java Heap Space
at java.lang.Class.getMethodImpl(Native Method)
at java.lang.Class.getMethod(Class.java:917)
at org.apache.lucene.store.MMapDirectory$MMapIndexInput$1.run(MMapDirectory.java:244)
at org.apache.lucene.store.MMapDirectory$MMapIndexInput$1.run(MMapDirectory.java:241)
at java.security.AccessController.doPrivileged(AccessController.java:327)
at org.apache.lucene.store.MMapDirectory$MMapIndexInput.freeBuffer(MMapDirectory.java:241)
at org.apache.lucene.store.ByteBufferIndexInput.close(ByteBufferIndexInput.java:295)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:788)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:694)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:400)
at org.apache.lucene.index.StandardDirectoryReader.isCurrent(StandardDirectoryReader.java:349)
at org.apache.lucene.index.StandardDirectoryReader.doOpenNoWriter(StandardDirectoryReader.java:303)
at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:266)
at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:254)
at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
Any idea, what can be the problem?
Since you're using MMapDirectory, you should be aware, that:
NOTE: memory mapping uses up a portion of the virtual memory address
space in your process equal to the size of the file being mapped.
Before using this class, be sure your have plenty of virtual address
space, e.g. by using a 64 bit JRE, or a 32 bit JRE with indexes that
are guaranteed to fit within the address space. On 32 bit platforms
also consult MMapDirectory(Path, LockFactory, int) if you have
problems with mmap failing because of fragmented address space. If you
get an OutOfMemoryException, it is recommended to reduce the chunk
size, until it works.
MMapDirectory uses memory-mapped IO when reading. This is a good
choice if you have plenty of virtual memory relative to your index
size, eg if you are running on a 64 bit JRE, or you are running on a
32 bit JRE but your index sizes are small enough to fit into the
virtual memory space
So, consider tuning your chunk size or choose other implementation of the Directory, like NIOFSDirectory or SimpleFSDirectory

File reading and checksums in go. Difference between methods

Recently I'm into creating checksums for files in go. My code is working with small and big files. I tried two methods, the first uses ioutil.ReadFile("filename") and the second is working with os.Open("filename").
Examples:
The first function is working with the io/ioutil and works for small files. When I try to copy a big file my ram gets blastet and for a 1.5GB iso it uses 3GB of ram.
func byteCopy(fileToCopy string) {
file, err := ioutil.ReadFile(fileToCopy) //1.5GB file
omg(err) //error handling function
ioutil.WriteFile("2.iso", file, 0777)
os.Remove("2.iso")
}
Even worse when I want to create a checksum with crypto/sha512 and io/ioutil.
It will never finish and abort because it runs out of memory.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
When using the function below everything works fine.
func ioHash() {
f, err := os.Open(iso) //iso is a big ~ 1.5tb file
omg(err) //error handling function
defer f.Close()
h := sha512.New()
io.Copy(h, f)
fmt.Printf("%x", h.Sum(nil))
}
My Question:
Why is the ioutil.ReadFile() function not working right? The 1.5GB file should not fill my 16GB of ram. I don't know where to look right now.
Could somebody explain the differences between the methods? I don't get it with reading the go-doc and examples.
Having usable code is nice, but understanding why its working is way above that.
Thanks in advance!
The following code doesn't do what you think it does.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
This first reads your 1.5GB iso. As jnml pointed out, it continuously makes bigger and bigger buffers to fill it. In the end, And total buffer size is no less than 1.5GB and no greater than 1.875GB (by the current implementation).
However, after that you then make another buffer! h.Sum(file) doesn't hash file. It appends the current hash to file! This may or may not cause yet another allocation.
The real problem is that you are taking that file, now appended with the hash, and printing it with %x. Fmt actually pre-computes using the same type of method jnml pointed out that ioutil.ReadAll used. So it constantly allocated bigger and bigger buffers to store the hex of your file. Since each letter is 4 bits, that means we are talking about no less than a 3GB buffer for that and no greater than 3.75GB.
This means your active buffers may be as big 5.625GB. Combine that with the GC not being perfect and not removing all the intermediate buffers, and it could very easily fill your space.
The correct way to write that code would have been.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
h.Write(file)
fmt.Printf("%x", h.Sum(nil))
}
This doesn't do nearly the number the allocations.
The bottom line is that ReadFile is rarely what you want to use. IO streaming (using readers and writers) is always the best way when it is an option. Not only do you allocate much less when you use io.Copy, you also hash and read the disk concurrently. In your ReadFile example, the two resources are used synchronously when they don't depend on each other.
ioutil.ReadFile is working right. It's your fault to abuse the system resources by using that function for things you know are huge.
ioutil.ReadFile is a handy helper for files you're pretty sure in advance that they're going to be small. Like configuration files, most source code files etc. (Actually it's optimizing things for files <= 1e9 bytes, but that's an implementation detail and not part of the API contract. Your 1.5GB file forces it to use slice growing and thus allocating more than one big buffer for your data in the process of reading the file.)
Even your other approach using os.File is not okay. You definitely should be using the "bufio" package for sequential processing of large files, see bufio.NewReader.

What's the PHP APC cache's apc.shm_strings_buffer setting for?

I'm trying to understand the apc.shm_strings_buffer setting in apc.ini. After restarting PHP, the pie chart in the APC admin shows 8MB of cache is already used, even though there are no cached entries (except for apc.php, of course). I've found this relates to the apc.shm_strings_buffer setting.
Can someone help me understand what the setting means? The config file notes that this is the "shared memory size reserved for strings, with M/G suffixe", but I fail to comprehend.
I'm using APC with PHP-FPM.
The easy part to explain is "with M/G suffixe" which means that if you set it to 8M, then 8 megabytes would be allocated, or 1G would allocated 1 gigabyte of memory.
The more difficult bit to explain is that it's a cache for storing strings that are used internally by APC when it's compiling and caching opcode.
The config value was introduced in this change and the bulk of the change was to add apc_string.c to the APC project. The main function that is defined in that C file is apc_new_interned_string which is then used in apc_string_pmemcpy in apc_compile.c. the rest of the APC module to store strings.
For example in apc_compile.c
/* private members are stored inside property_info as a mangled
* string of the form:
* \0<classname>\0<membername>\0
*/
CHECK((dst->name = apc_string_pmemcpy((char *)src->name, src->name_length+1, pool TSRMLS_CC)));
When APC goes to store a string, the function apc_new_interned_string looks to see if it that string is already saved in memory by doing a hash on the string, and if it is already stored in memory, it returns the previous instance of the stored string.
Only if that string is not already stored in the cache does a new piece of memory get allocated to store the string.
If you're running PHP with PHP-FPM, I'm 90% confident that the cache of stored strings is shared amongst all the workers in a single pool, but am still double-checking that.
The whole size allocated to storing shared strings is allocated when PHP starts up - it's not allocated dynamically. So it's to be expected that APC shows the 8MB used for the string cache, even though hardly any strings have actually been cached yet.
Edit
Although this answers what it does, I have no idea how to see how much of the shared string buffer is being used, so there's no way of knowing what it should be set to.