Linux oom killer - find fragmented process - process

I am currently facing major issues with OOM after upgrading from Debian 7 to Debian 8 on arm (cubietruck).
I have already read a lot about oom mechanism and causes, still I miss an answer to
How to find memory fragmentation of a running process?
I have cat-ed /proc/buddyinfo and see fragmentation raise till the moment where oom kills a process. But this is only for the whole system. Is there a possibility to get buddyinfo like information on a per process base?
P.S.: I assume some process creates (via kmalloc?) fragmentation or by its run-time behavior triggering actions in the kernel that fragments memory (buffers, control structures or whatever ...)
The cubietruck system has 2GB RAM and 4GB swap.
2017-03-16 09:06:17 cubietruck kernel:[2114250.857191]
HighMem free:245388kB min:512kB low:2016kB high:3520kB
active_anon:200864kB
inactive_anon:230728kB
active_file:331288kB
inactive_file:294800kB
unevictab le:0kB
writepending:0kB
present:1307648kB
managed:1307648kB
mlocked:0kB
slab_reclaimable:0kB
slab_unreclaimable:0kB
kernel_stack:0kB
pagetables:2456kB
bounce:0kB
free_pcp:92kB
local_pcp:0kB
free_cma:0kB
lowmem_reserve[]: 0 0 0

Related

How does TensorFlow use both shared and dedicated GPU memory on the GPU on Windows 10?

When running a TensorFlow job I sometimes get a non-fatal error that says GPU memory exceeded, and then I see the "Shared memory GPU usage" go up on the Performance Monitor on Windows 10.
How does TensorFlow achieve this? I have looked at CUDA documentation and not found a reference to the Dedicated and Shared concepts used in the Performance Monitor. There is a Shared Memory concept in CUDA but I think it is something on the device, not the RAM I see in the Performance Monitor, which is allocated by the BIOS from CPU RAM.
Note: A similar question was asked but not answered by another poster.
Shared memory in windows 10 does not refer to the same concept as cuda shared memory (or local memory in opencl), it refers to host accessible/allocated memory from the GPU. For integrated graphics processing host and device memory is usually the same as shared thanks to both the cpu and gpu being located on the same die and being able to access the same ram. For dedicated graphics with their own memory, this is separate memory allocated on the host side for use by the GPU.
Shared memory for compute APIs such as through GLSL compute shaders, or Nvidia CUDA kernels refer to a programmer managed cache layer (some times refereed to as "scratch pad memory") which on Nvidia devices, exists per SM, and can only be accessed by a single SM and is usually between 32kB to 96kB per SM. Its purpose is to speed up memory access to data which is used often.
If you see and increase shared memory used in Tensorflow, you have a dedicated graphics card, and you are experiencing "GPU memory exceeded" it most likely means you are using too much memory on the GPU itself, so it is trying to allocate memory from elsewhere (IE from system RAM). This potentially can make your program much slower as the bandwidth and latency will be much worse on non device memory for a dedicated graphics card.
I think I figured this out by accident. The "Shared GPU Memory" reported by Windows 10 Task Manager Performance tab does get used, if there are multiple processes hitting the GPU simultaneously. I discovered this by writing a Python programming that used multiprocessing to queue up multiple GPU tasks, and I saw the "Shared GPU memory" start filling up. This is the only way I've seen it happen.
So it is only for queueing tasks. Each individual task is still limited to the onboard DRAM minus whatever is permanently allocated to actual graphics processing, which seems to be around 1GB.

Improving performance of Redis set up (degraded after setting vm.overcommit_memory=1)

Need some help in diagnosing and tuning the performance of my Redis set up (2 redis-server instances on an Ubuntu 14.04 machine). Note that a write-heavy Django web application shares the VM with Redis. The machine has 8 cores and 25GB RAM.
I recently discovered that background saving was intermittently failing (with a fork() error) even when RAM wasn't exhausted. To remedy this, I applied the setting vm.overcommit_memory=1 (was previously default).
Moreover vm.swappiness=2, vm.overcommit_ratio=50. I have disabled transparent huge pages in my set up as well via echo never > /sys/kernel/mm/transparent_hugepage/enabled (although haven't done echo never > /sys/kernel/mm/transparent_hugepage/defrag).
Right after changing the overcommit_memory setting, I noticed that I/O utilization went from 13% to 36% (on average). I/O operations per second doubled, the redis-server CPU consumption has more than doubled, and the memory it's consuming has gone up 66%. Consequently, the server response time has substantially gone up . This is how abruptly things escalated after applying vm.overcommit_memory=1:
Note that redis-server is the only ingredient showing escalation - gunicorn, nginx ,celery etc. are performing like before. Moreover, redis has become very spikey.
Lastly, New Relic has started showing me 3 redis instances instead of 2 (bottom most graph). I think the forked child is counted as the 3rd:
My question is: how can I diagnose and salvage performance here? Being new to server administration, I'm unsure how to proceed. Help me find out what's going on here and how I can fix it.
free -m has the following output (in case needed):
total used free shared buffers cached
Mem: 28136 27912 224 576 68 6778
-/+ buffers/cache: 21064 7071
Swap: 0 0 0
As you don't have swap enabled in your system ( which might be worth reconsidering if you have SSDs), ( and your swappiness was set to a low value), you can't blame it on increased swapping due to memory contention.
Your caching about 6GB of data inside the VFS cache. In case of contention this cache would have depleted in favor of process working memory, so I believe it's safe to say memory is not an issue all together.
It's a shot in the dark, but my guess is that your redis-server is configured to "sync"/"save" too often ( search for in the redis config file "appendfsync"), and that by removing the memory allocation limitation, it now actually does it's job :)
If the data is not super crucial, set appendfsync to never and perhaps tweek the save settings to cause less frequent saving.
BTW, regarding the redis & forked child, I believe you are correct.

Shinking JVM memory and Swap

Virtual Machine:
4CPU
10GB RAM
10GB swap
Java 1.7
-Xms=-Xmx=6144m
Tomcat 7
We observed a very strange behaviour with the JVM. The JVm resident memory began to shrink and the swap usage shot up to over 50%.
Please see below stats from monitoring tools.
http://i44.tinypic.com/206n6sp.jpg
http://i44.tinypic.com/m99hl0.jpg
Any pointers to understand this is grateful.
Thanks!
Or maybe your Java program was idle and it didn't need that memory, and you have high swappiness? In such situation your OS would free RAM just in case and leave only used part.
In my opinion, that is actually good behaviour, why should you waste RAM for process that won't use it?
Unless you run only this one process on VM, then it would be quite good idea to set swappiness to 0 or other small number - this memory was given to this single process, so we may disable swapping it.
Thanks for the response. Yes this is more close to a system troubleshooting than Java but I thought this the right forum to initiate this topic incase anybody has seen such a phenomena with JVM.
Anyways, I had already checked the top and no there was no other process than Java which was hungry for memory. Actually the second top process was utilizing 72MB (RSS).
No the swappiness is not aggressive set on this system but at default 60. One additional information I missed to share is we have 4 app servers in cluster and all showed this behaviour exactly at the same time. AFAIK, JVM does not swap out but the OS would. But all of it is what confusing me.
All these app servers are production and busy serving request so not idle. The used Heap size was at Avg 5 GB used of the the 6GB.
The other interesting thing I found out were some failed messages in the Vmware logs at the same time which is what I'm investigating.

Which takes longer time? Switching between the user & kernel modes or switching between two processes?

Which takes longer time?
Switching between the user & kernel modes (or) switching between two processes?
Please explain the reason too.
EDIT : I do know that whenever there is a context switch, it takes some time for the dispatcher to save the status of the previous process in its PCB, and then reload the next process from its corresponding PCB. And for switching between the user and the kernel modes, I know that the mode bit has to be changed. Isn't it all, or is there more to it?
Switching between processes (given you actually switch, not run them in parallel) by an order of oh-my-god.
Trapping from userspace to kernelspace used to be done with a processor interrupt earlier. Around 2005 (don't remember the kernel version), and after a discussion on the mailing list where someone found that trapping was slower (in absolute measures!) on a high-end xeon processor than on an earlier Pentium II or III (again, my memory), they implemented it with a new cpu instruction sysenter (which had actually existed since Pentium Pro I think). This is done in the Virtual Dynamic Shared Object (vdso) page in each process (cat /proc/pid/maps to find it) IIRC.
So, nowadays, a kernel trap is basically just a couple of cpu instructions, hence rather few cycles, compared to tenths or hundreds of thousands when using an interrupt (which is really slow on modern CPU's).
A context switch between processes is heavy. It means storing all processor state (registers, etc) to RAM (at a magic memory location in the user process space actually, guess where!), in practice dirtying all cached memory in the cpu, and reading back the process state for the new process. It will (likely) have nothing still in the cpu cache from last time it ran, so each memory read will be a cache miss, and needed to be read from RAM. This is rather slow. When I was at the university, I "invented" (well, I did come up with the idea, knowing that there is plenty of dye in a CPU, but not enough cool if it's constantly powered) a cache that was infinite size although unpowered when unused (only used on context switches i.e.) in the CPU, and implemented this in Simics. Implemented support for this magic cache I called CARD (Context-switch Active, Run-time Drowsy) in Linux, and benchmarked rather heavily. I found that it could speed-up a Linux machine with lots of heavy processes sharing the same core with about 5%. This was at relatively short (low-latency) process time slices, though.
Anyway. A context switch is still pretty heavy, while a kernel trap is basically free.
Answer to at which memory location in user-space, for each process:
At address zero. Yep, the null pointer! You can't read from this entire page from user-space anyway :) This was back in 2005, but it's probably the same now unless the CPU state information has grown larger than a page size, in which case they might have changed the implementation.

Memory usage keep growing while writing the Lucene.Net index

I open this discussion since googling about the Lucene.Net usage I have not found anything really useful.
The issue is simple: I am experiencing a problem in building and updating the Lucene.Net index. In particular its memory usage keeps growing while even if I fix the SetRAMBufferSizeMB to 256, SetMergeFactor to 100 and SetMaxMergeDocs to 100000. Moreover I use carefully the Close() and Commit() methods every time the index is used.
To make lucene.Net works for my data I started from this tutorial: http://www.lucenetutorial.com/lucene-in-5-minutes.html
It seems that for 10^5 and 10^6 documents 1.8GB of ram are necessary. Therefore, why have I to set SetRAMBufferSizeMB parameter if the actual RAM usage is 7 times more? Does anyone really know how to keep the memory usage bound?
Moreover, I observed that to deal with 10^5 or 10^6 documents it is necessary to compile Lucene.Net for an x64 platform. Indeed if I compile the code for an x86 platform the indexing crash systematically touching 1.2GB of RAM.
Does anyone is able to index the same amount of documents (or even more) using less RAM? In which hardware and software setting? My environment configuration is the following:
- os := win7 32/64 bits.
- sw := .Net framework 4.0
- hw := a 12 core Xeon workstation with 6GB of RAM.
- Lucene.Net rel.: 2.9.4g (current stable).
- Lucene.Net directory type: FSDirectory (the index is written into the disk).
OK, I tested the code using your advice on the re-usage of Document/Fields instances however the code performs exactly the same in terms of memory usage.
Here I post few debugging lines for some parameters I have tracked during the indexing process of 1000000 documents.
DEBUG - BuildIndex – IndexWriter - RamSizeInBytes 424960B; index process dimension 1164328960B. 4% of the indexing process.
DEBUG - BuildIndex – IndexWriter - RamSizeInBytes 457728B; index process dimension 1282666496B. 5% of the indexing process.
DEBUG - BuildIndex – IndexWriter - RamSizeInBytes 457728B; index process dimension 1477861376B. 6% of the indexing process.
The index process dimension is obtained as follows:
It is easy to observe how fast the process grows in RAM (~1.5GB at the 6% of the indexing process) even if the RAM buffer exploited by the IndexWriter is more or less unchanged. Therefore, the question is: is it possible to explicitly limit the RAM usage of the indexing process size? I do not care if the performances drop down during the searching phase and if I have to wait for a while for a complete index, but I need to be sure that the indexing process does not hit an OOM or a stack overflow error indexing a large number of documents. How can I do that if it is impossible to limit the memory usage?
For completeness, I post the code used for the debugging:
// get the current process
Process currentProcess = System.Diagnostics.Process.GetCurrentProcess();
// get the physical mem usage of the index writer
long totalBytesOfIndex = writer.RamSizeInBytes();
// get the physical mem usage
long totalBytesOfMemoryUsed = currentProcess.WorkingSet64;
Finally, I found the bug. it is contained into the ItalianAnalyzer (the analiser for the Italian Language) which has been built exploiting the Luca Gentili contribution (http://snowball.tartarus.org/algorithms/italian/stemmer.html). Indeed inside the ItalianAnalyzer class a file containing the stop words was open several times and after each usage it was not closed. This was the reason of the OOM problem for me.
Solving this bug Lucene.Net is light-speed both to build the index and search.
The SetRAMBufferSizeMB is just one of the way to determine when to flush the IndexWriter to disk. It will flush segments data when XXX MB are writen to memory and ready to be flushed to disk.
There are lots of other objects in Lucene that will also use memory, and have nothing to do with the RamBuffer.
Usually, the first thing to try when you run OOM while indexing is to re-use Document/Fields instances. If you multithread indexation, make sure you only reuse them on the same thread. It happened to me to run OOM because of that, when the underlying IO is blazing fast and the .NET garbage collector just cant keep up with all the small objects created.