Related
We are running a .NET application in fargate via terraform where we specify CPU and memory in the aws_ecs_task_definition resource.
The service has just 1 task e.g.
resource "aws_ecs_task_definition" "test" {
....
cpu = 256
memory = 512
....
From the documentation this is required for Fargate.
You can also specify cpu and memory in the container_definitions, but the documentation states that the field is optional, and as we are already setting values at the task level we did not set them here.
We have observed that our memory was growing after the tasks started, depending on application, sometimes quite fast and others over a period of time.
So we starting thinking we had a memory leak and went to profile using the dotnet-monitor tool as a sidecar.
As part of introducing the sidecar we set cpu and memory values for our .NET application at the container_definitions level.
After we done this, we have observed that our memory in our applications is behaving much better.
From .NET monitor traces we are seeing that when we set memory at the container_definitions level:
Working Set is much smaller
Gen 0/1/2 GC Count is above 1(GC occurring early)
GC 0/1/2 Size is less
GC Committed Bytes is smaller
So to summarize when we do not set memory at container_definitions level, memory continues to grow and no GC occurring until we are almost running out of memory.
When we set memory at container_definitions level, GC occurring regularly and memory not spiking up.
So we have a solution, but do not understand why this is the case.
Would like to know why it is so
For my application, the memory used by the Java process is much more than the heap size.
The system where the containers are running starts to have memory problem because the container is taking much more memory than the heap size.
The heap size is set to 128 MB (-Xmx128m -Xms128m) while the container takes up to 1GB of memory. Under normal condition, it needs 500MB. If the docker container has a limit below (e.g. mem_limit=mem_limit=400MB) the process gets killed by the out of memory killer of the OS.
Could you explain why the Java process is using much more memory than the heap? How to size correctly the Docker memory limit? Is there a way to reduce the off-heap memory footprint of the Java process?
I gather some details about the issue using command from Native memory tracking in JVM.
From the host system, I get the memory used by the container.
$ docker stats --no-stream 9afcb62a26c8
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
9afcb62a26c8 xx-xxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.0acbb46bb6fe3ae1b1c99aff3a6073bb7b7ecf85 0.93% 461MiB / 9.744GiB 4.62% 286MB / 7.92MB 157MB / 2.66GB 57
From inside the container, I get the memory used by the process.
$ ps -p 71 -o pcpu,rss,size,vsize
%CPU RSS SIZE VSZ
11.2 486040 580860 3814600
$ jcmd 71 VM.native_memory
71:
Native Memory Tracking:
Total: reserved=1631932KB, committed=367400KB
- Java Heap (reserved=131072KB, committed=131072KB)
(mmap: reserved=131072KB, committed=131072KB)
- Class (reserved=1120142KB, committed=79830KB)
(classes #15267)
( instance classes #14230, array classes #1037)
(malloc=1934KB #32977)
(mmap: reserved=1118208KB, committed=77896KB)
( Metadata: )
( reserved=69632KB, committed=68272KB)
( used=66725KB)
( free=1547KB)
( waste=0KB =0.00%)
( Class space:)
( reserved=1048576KB, committed=9624KB)
( used=8939KB)
( free=685KB)
( waste=0KB =0.00%)
- Thread (reserved=24786KB, committed=5294KB)
(thread #56)
(stack: reserved=24500KB, committed=5008KB)
(malloc=198KB #293)
(arena=88KB #110)
- Code (reserved=250635KB, committed=45907KB)
(malloc=2947KB #13459)
(mmap: reserved=247688KB, committed=42960KB)
- GC (reserved=48091KB, committed=48091KB)
(malloc=10439KB #18634)
(mmap: reserved=37652KB, committed=37652KB)
- Compiler (reserved=358KB, committed=358KB)
(malloc=249KB #1450)
(arena=109KB #5)
- Internal (reserved=1165KB, committed=1165KB)
(malloc=1125KB #3363)
(mmap: reserved=40KB, committed=40KB)
- Other (reserved=16696KB, committed=16696KB)
(malloc=16696KB #35)
- Symbol (reserved=15277KB, committed=15277KB)
(malloc=13543KB #180850)
(arena=1734KB #1)
- Native Memory Tracking (reserved=4436KB, committed=4436KB)
(malloc=378KB #5359)
(tracking overhead=4058KB)
- Shared class space (reserved=17144KB, committed=17144KB)
(mmap: reserved=17144KB, committed=17144KB)
- Arena Chunk (reserved=1850KB, committed=1850KB)
(malloc=1850KB)
- Logging (reserved=4KB, committed=4KB)
(malloc=4KB #179)
- Arguments (reserved=19KB, committed=19KB)
(malloc=19KB #512)
- Module (reserved=258KB, committed=258KB)
(malloc=258KB #2356)
$ cat /proc/71/smaps | grep Rss | cut -d: -f2 | tr -d " " | cut -f1 -dk | sort -n | awk '{ sum += $1 } END { print sum }'
491080
The application is a web server using Jetty/Jersey/CDI bundled inside a fat far of 36 MB.
The following version of OS and Java are used (inside the container). The Docker image is based on openjdk:11-jre-slim.
$ java -version
openjdk version "11" 2018-09-25
OpenJDK Runtime Environment (build 11+28-Debian-1)
OpenJDK 64-Bit Server VM (build 11+28-Debian-1, mixed mode, sharing)
$ uname -a
Linux service1 4.9.125-linuxkit #1 SMP Fri Sep 7 08:20:28 UTC 2018 x86_64 GNU/Linux
https://gist.github.com/prasanthj/48e7063cac88eb396bc9961fb3149b58
Virtual memory used by a Java process extends far beyond just Java Heap. You know, JVM includes many subsytems: Garbage Collector, Class Loading, JIT compilers etc., and all these subsystems require certain amount of RAM to function.
JVM is not the only consumer of RAM. Native libraries (including standard Java Class Library) may also allocate native memory. And this won't be even visible to Native Memory Tracking. Java application itself can also use off-heap memory by means of direct ByteBuffers.
So what takes memory in a Java process?
JVM parts (mostly shown by Native Memory Tracking)
1. Java Heap
The most obvious part. This is where Java objects live. Heap takes up to -Xmx amount of memory.
2. Garbage Collector
GC structures and algorithms require additional memory for heap management. These structures are Mark Bitmap, Mark Stack (for traversing object graph), Remembered Sets (for recording inter-region references) and others. Some of them are directly tunable, e.g. -XX:MarkStackSizeMax, others depend on heap layout, e.g. the larger are G1 regions (-XX:G1HeapRegionSize), the smaller are remembered sets.
GC memory overhead varies between GC algorithms. -XX:+UseSerialGC and -XX:+UseShenandoahGC have the smallest overhead. G1 or CMS may easily use around 10% of total heap size.
3. Code Cache
Contains dynamically generated code: JIT-compiled methods, interpreter and run-time stubs. Its size is limited by -XX:ReservedCodeCacheSize (240M by default). Turn off -XX:-TieredCompilation to reduce the amount of compiled code and thus the Code Cache usage.
4. Compiler
JIT compiler itself also requires memory to do its job. This can be reduced again by switching off Tiered Compilation or by reducing the number of compiler threads: -XX:CICompilerCount.
5. Class loading
Class metadata (method bytecodes, symbols, constant pools, annotations etc.) is stored in off-heap area called Metaspace. The more classes are loaded - the more metaspace is used. Total usage can be limited by -XX:MaxMetaspaceSize (unlimited by default) and -XX:CompressedClassSpaceSize (1G by default).
6. Symbol tables
Two main hashtables of the JVM: the Symbol table contains names, signatures, identifiers etc. and the String table contains references to interned strings. If Native Memory Tracking indicates significant memory usage by a String table, it probably means the application excessively calls String.intern.
7. Threads
Thread stacks are also responsible for taking RAM. The stack size is controlled by -Xss. The default is 1M per thread, but fortunately things are not so bad. The OS allocates memory pages lazily, i.e. on the first use, so the actual memory usage will be much lower (typically 80-200 KB per thread stack). I wrote a script to estimate how much of RSS belongs to Java thread stacks.
There are other JVM parts that allocate native memory, but they do not usually play a big role in total memory consumption.
Direct buffers
An application may explicitly request off-heap memory by calling ByteBuffer.allocateDirect. The default off-heap limit is equal to -Xmx, but it can be overridden with -XX:MaxDirectMemorySize. Direct ByteBuffers are included in Other section of NMT output (or Internal before JDK 11).
The amount of direct memory in use is visible through JMX, e.g. in JConsole or Java Mission Control:
Besides direct ByteBuffers there can be MappedByteBuffers - the files mapped to virtual memory of a process. NMT does not track them, however, MappedByteBuffers can also take physical memory. And there is no a simple way to limit how much they can take. You can just see the actual usage by looking at process memory map: pmap -x <pid>
Address Kbytes RSS Dirty Mode Mapping
...
00007f2b3e557000 39592 32956 0 r--s- some-file-17405-Index.db
00007f2b40c01000 39600 33092 0 r--s- some-file-17404-Index.db
^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^
Native libraries
JNI code loaded by System.loadLibrary can allocate as much off-heap memory as it wants with no control from JVM side. This also concerns standard Java Class Library. In particular, unclosed Java resources may become a source of native memory leak. Typical examples are ZipInputStream or DirectoryStream.
JVMTI agents, in particular, jdwp debugging agent - can also cause excessive memory consumption.
This answer describes how to profile native memory allocations with async-profiler.
Allocator issues
A process typically requests native memory either directly from OS (by mmap system call) or by using malloc - standard libc allocator. In turn, malloc requests big chunks of memory from OS using mmap, and then manages these chunks according to its own allocation algorithm. The problem is - this algorithm can lead to fragmentation and excessive virtual memory usage.
jemalloc, an alternative allocator, often appears smarter than regular libc malloc, so switching to jemalloc may result in a smaller footprint for free.
Conclusion
There is no guaranteed way to estimate full memory usage of a Java process, because there are too many factors to consider.
Total memory = Heap + Code Cache + Metaspace + Symbol tables +
Other JVM structures + Thread stacks +
Direct buffers + Mapped files +
Native Libraries + Malloc overhead + ...
It is possible to shrink or limit certain memory areas (like Code Cache) by JVM flags, but many others are out of JVM control at all.
One possible approach to setting Docker limits would be to watch the actual memory usage in a "normal" state of the process. There are tools and techniques for investigating issues with Java memory consumption: Native Memory Tracking, pmap, jemalloc, async-profiler.
Update
Here is a recording of my presentation Memory Footprint of a Java Process.
In this video, I discuss what may consume memory in a Java process, how to monitor and restrain the size of certain memory areas, and how to profile native memory leaks in a Java application.
https://developers.redhat.com/blog/2017/04/04/openjdk-and-containers/:
Why is it when I specify -Xmx=1g my JVM uses up more memory than 1gb
of memory?
Specifying -Xmx=1g is telling the JVM to allocate a 1gb heap. It’s not
telling the JVM to limit its entire memory usage to 1gb. There are
card tables, code caches, and all sorts of other off heap data
structures. The parameter you use to specify total memory usage is
-XX:MaxRAM. Be aware that with -XX:MaxRam=500m your heap will be approximately 250mb.
Java sees host memory size and it is not aware of any container memory limitations. It doesn't create memory pressure, so GC also doesn't need to release used memory. I hope XX:MaxRAM will help you to reduce memory footprint. Eventually, you can tweak GC configuration (-XX:MinHeapFreeRatio,-XX:MaxHeapFreeRatio, ...)
There is many types of memory metrics. Docker seems to be reporting RSS memory size, that can be different than "committed" memory reported by jcmd (older versions of Docker report RSS+cache as memory usage).
Good discussion and links: Difference between Resident Set Size (RSS) and Java total committed memory (NMT) for a JVM running in Docker container
(RSS) memory can be eaten also by some other utilities in the container - shell, process manager, ... We don't know what else is running in the container and how do you start processes in container.
TL;DR
The detail usage of the memory is provided by Native Memory Tracking (NMT) details (mainly code metadata and garbage collector). In addition to that, the Java compiler and optimizer C1/C2 consume the memory not reported in the summary.
The memory footprint can be reduced using JVM flags (but there is impacts).
The Docker container sizing must be done through testing with the expected load the application.
Detail for each components
The shared class space can be disabled inside a container since the classes won't be shared by another JVM process. The following flag can be used. It will remove the shared class space (17MB).
-Xshare:off
The garbage collector serial has a minimal memory footprint at the cost of longer pause time during garbage collect processing (see Aleksey Shipilëv comparison between GC in one picture). It can be enabled with the following flag. It can save up to the GC space used (48MB).
-XX:+UseSerialGC
The C2 compiler can be disabled with the following flag to reduce profiling data used to decide whether to optimize or not a method.
-XX:+TieredCompilation -XX:TieredStopAtLevel=1
The code space is reduced by 20MB. Moreover, the memory outside JVM is reduced by 80MB (difference between NMT space and RSS space). The optimizing compiler C2 needs 100MB.
The C1 and C2 compilers can be disabled with the following flag.
-Xint
The memory outside the JVM is now lower than the total committed space. The code space is reduced by 43MB. Beware, this has a major impact on the performance of the application. Disabling C1 and C2 compiler reduces the memory used by 170 MB.
Using Graal VM compiler (replacement of C2) leads to a bit smaller memory footprint. It increases of 20MB the code memory space and decreases of 60MB from outside JVM memory.
The article Java Memory Management for JVM provides some relevant information the different memory spaces.
Oracle provides some details in Native Memory Tracking documentation. More details about compilation level in advanced compilation policy and in disable C2 reduce code cache size by a factor 5. Some details on Why does a JVM report more committed memory than the Linux process resident set size? when both compilers are disabled.
Java needs a lot a memory. JVM itself needs a lot of memory to run. The heap is the memory which is available inside the virtual machine, available to your application. Because JVM is a big bundle packed with all goodies possible it takes a lot of memory just to load.
Starting with java 9 you have something called project Jigsaw, which might reduce the memory used when you start a java app(along with start time). Project jigsaw and a new module system were not necessarily created to reduce the necessary memory, but if it's important you can give a try.
You can take a look at this example: https://steveperkins.com/using-java-9-modularization-to-ship-zero-dependency-native-apps/. By using the module system it resulted in CLI application of 21MB(with JRE embeded). JRE takes more than 200mb. That should translate to less allocated memory when the application is up(a lot of unused JRE classes will no longer be loaded).
Here is another nice tutorial: https://www.baeldung.com/project-jigsaw-java-modularity
If you don't want to spend time with this you can simply get allocate more memory. Sometimes it's the best.
How to size correctly the Docker memory limit?
Check the application by monitoring it for some-time. To restrict container's memory try using -m, --memory bytes option for docker run command - or something equivalant if you are running it otherwise
like
docker run -d --name my-container --memory 500m <iamge-name>
can't answer other questions.
I restarted my redis server after 120 days.
Before restart, memory usage 29.5GB
After restarted, memory usage 27.5GB
So, how 2GB reduced comes?
Free memory in ram like this article https://redis.io/topics/memory-optimization
Redis will not always free up (return) memory to the OS when keys are
removed. This is not something special about Redis, but it is how most
malloc() implementations work. For example if you fill an instance
with 5GB worth of data, and then remove the equivalent of 2GB of data,
the Resident Set Size (also known as the RSS, which is the number of
memory pages consumed by the process) will probably still be around
5GB, even if Redis will claim that the user memory is around 3GB. This
happens because the underlying allocator can't easily release the
memory. For example often most of the removed keys were allocated in
the same pages as the other keys that still exist. The previous point
means that you need to provision memory based on your peak memory
usage. If your workload from time to time requires 10GB, even if most
of the times 5GB could do, you need to provision for 10GB.
However allocators are smart and are able to reuse free chunks of
memory, so after you freed 2GB of your 5GB data set, when you start
adding more keys again, you'll see the RSS (Resident Set Size) to stay
steady and don't grow more, as you add up to 2GB of additional keys.
The allocator is basically trying to reuse the 2GB of memory
previously (logically) freed.
Because of all this, the fragmentation ratio is not reliable when you
had a memory usage that at peak is much larger than the currently used
memory. The fragmentation is calculated as the amount of memory
currently in use (as the sum of all the allocations performed by
Redis) divided by the physical memory actually used (the RSS value).
Because the RSS reflects the peak memory, when the (virtually) used
memory is low since a lot of keys / values were freed, but the RSS is
high, the ratio mem_used / RSS will be very high.
Or free memory of caches which was used by my redis server?
Is redis using cache? Cache of cache?
Thanks!
There are two types of memory in Vulkan buzzling me:
VK_MEMORY_PROPERTY_HOST_COHERENT_BIT bit indicates that the host cache
management commands vkFlushMappedMemoryRanges and
vkInvalidateMappedMemoryRanges are not needed to flush host writes to
the device or make device writes visible to the host, respectively.
VK_MEMORY_PROPERTY_HOST_CACHED_BIT bit indicates that memory allocated
with this type is cached on the host. Host memory accesses to uncached
memory are slower than to cached memory, however uncached memory is
always host coherent.
From what I understand is that modification of memory of type COHERENT is seen immediately by both the host and the device, and modifications to memory of type CACHED may not be seen immediately by the host and/or the device, i.e. invalidating/flushing the memory is needed to invalidate the cache.
I have seen some implementations combine both flags, and it is valid combinations according to the 10.2. Device Memory section in the documentation. Isn't there a contradictory (cached and coherent)?
Cached/coherent memory effectively means that the GPU can see the CPU's caches. This often happens on architectures where the GPU and the CPU are sitting on the same chip. The GPU is effectively just another core on the CPU's die, with access to the CPU's core.
But it can happen on other architectures as well. Some standalone GPUs offer cached/coherent memory. Indeed, most of them don't offer cached memory without coherency. From an architectural standpoint, it represents some way for the GPU to access data through at least part of the CPU cache.
The key thing about cached/coherent memory you should remember is this: if there is an alternative memory type for that memory pool, then the alternative is probably faster for the device to access. Also, if alternatives exist, it is entirely possible that the device may not be able to have images or buffers of certain types/formats stored in such memory types. So unless you really need cached memory access from the CPU, or the device offers no alternative, it's best to avoid it.
There are cache schemes that monitor write accesses to RAM over the memory bus to invalidate the Host's cache when the memory is written to.
This allows the best of both worlds, cached coherent accesses but at the cost of a more complex architecture.
As I continuously write data to redis, the memory used by copy-on-write keeps increasing. Even though I write my program to sleep long enough so that redis will be able to finish all the background save (last memory message is 0 MB of memory used by copy-on-write), the next background save will go back to the high number.
Example,
1300MB of memory used by cow
1400MB of memory used by cow
0MB of memory used by cow
1500MB of memory used by cow
What exactly do all these means? As far as I know, if the copy-on-write memory keeps increasing, there is no way there is enough ram. Also, with each background save that is of high memory used, redis seems non-functional. Jedis always hit the socket timeout exception.
Here I will explain a few things: what Copy-on-Write (CoW) is and how it consumes the memory, why setting 'vm.overcommit_memory = 1' won't help the memory usage and performance issue, and best practices of backing up Redis data.
Copy-on-Write and its memory usage
Redis' snapshot backup leverage the CoW semantics, which is provided by modern operating system to resolve the issue that when forking processes, the memory of the parent process is copied to the child process thus doubles the memory footprint. In CoW, the forked child process will share the original memory space of the parent process. It only copies the memory page when either process modifies that memory page. Here is an illustration of the memory space before data modification and after data modification:
When the Redis' RDB backup is on-going, there will be data changes happening in the parent process, which is accepting new requests from clients and handling it in the memory. If the QPS is high, the parent process will copy tons of memory pages for the new changes during the child process' backup time. Thus the parent process will consume extra memory. In extreme cases, if all of the memory pages are modified, the memory footprint of the Redis instance will be doubled. Yeah, there is a possibility that the memory is doubled, and this fact will explain why Redis provides the "overcommit_memory=1" option, and what problem it can resolve, what it cannot (reducing the memory usage).
What "vm.overcommit_memory = 1" is, and what issues it resolves
During the RDB backup, you may see such log error:
10202:M 13 Sep 11:34:16.535 # Can't save in background: fork: Cannot allocate memory
It indicates there is not enough memory to fork the child process to do the backup. If the Redis process consumes 2GB memory now, when forking the child process, operating system will assume you have ANOTHER 2GB memory, so that in extreme cases of CoW, there is sufficient memory to copy all dirty memory pages. Even the extra memory is not used yet when forking the child process, it checks the idle memory to avoid later out-of-memory errors. In the Redis log, it provides the solution:
10202:M 13 Sep 11:33:09.943 # WARNING overcommit_memory is set to 0! Background
save may fail under low memory condition. To fix this issue add
'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the
command 'sysctl vm.overcommit_memory=1' for this to take effect.
So setting 'vm.overcommit_memory = 1' will allow you to fork the child process when the idle memory is low. If you know the dirty memory pages during the backup process won't be too many, there won't be any actual problems because the memory will be allocated successfully every time a new CoW operation happens.
And, 'vm.overcommit_memory = 1' only guarantees that you can fork the child process to backup the Redis data, but it cannot reduce the memory usage if there are writing operations happening all the time in the parent process.
Redis backup practice
There are three ways of persisting the Redis memory data: RDB(snapshotting), AOF, and the hybrid of the two. Any approach will impact the server response time to some extent no matter how you config the settings. To minimize the impact of the persisting process, we normally run the backup in slave instance instead on the master instance. However, there is a new risk if we do it on a slave. When there is network partitions happening, the slave may not be able to keep up-to-date, so backing up on a slave will risk losing some data. One resolution is to have multiple slaves, so the chance of having all of them out-of-sync with the master instance is lowered. Another prevention is setting up robust monitoring system, so we can detect network issues sooner and reduce the time span of the network partition.
From the Redis FAQ:
Redis background saving schema relies on the copy-on-write semantic of the fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory, the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory, all the pages may change while the child process is saving.
The increased memory usage during the save process is dependent on the number of writes performed while the dump is undergoing because of the copy-on-write (COW) mechanism.
What you could do instead is, configure a Redis slave and delegate the task of persistence to it.