what is copy-on-write memory - redis

As I continuously write data to redis, the memory used by copy-on-write keeps increasing. Even though I write my program to sleep long enough so that redis will be able to finish all the background save (last memory message is 0 MB of memory used by copy-on-write), the next background save will go back to the high number.
Example,
1300MB of memory used by cow
1400MB of memory used by cow
0MB of memory used by cow
1500MB of memory used by cow
What exactly do all these means? As far as I know, if the copy-on-write memory keeps increasing, there is no way there is enough ram. Also, with each background save that is of high memory used, redis seems non-functional. Jedis always hit the socket timeout exception.

Here I will explain a few things: what Copy-on-Write (CoW) is and how it consumes the memory, why setting 'vm.overcommit_memory = 1' won't help the memory usage and performance issue, and best practices of backing up Redis data.
Copy-on-Write and its memory usage
Redis' snapshot backup leverage the CoW semantics, which is provided by modern operating system to resolve the issue that when forking processes, the memory of the parent process is copied to the child process thus doubles the memory footprint. In CoW, the forked child process will share the original memory space of the parent process. It only copies the memory page when either process modifies that memory page. Here is an illustration of the memory space before data modification and after data modification:
When the Redis' RDB backup is on-going, there will be data changes happening in the parent process, which is accepting new requests from clients and handling it in the memory. If the QPS is high, the parent process will copy tons of memory pages for the new changes during the child process' backup time. Thus the parent process will consume extra memory. In extreme cases, if all of the memory pages are modified, the memory footprint of the Redis instance will be doubled. Yeah, there is a possibility that the memory is doubled, and this fact will explain why Redis provides the "overcommit_memory=1" option, and what problem it can resolve, what it cannot (reducing the memory usage).
What "vm.overcommit_memory = 1" is, and what issues it resolves
During the RDB backup, you may see such log error:
10202:M 13 Sep 11:34:16.535 # Can't save in background: fork: Cannot allocate memory
It indicates there is not enough memory to fork the child process to do the backup. If the Redis process consumes 2GB memory now, when forking the child process, operating system will assume you have ANOTHER 2GB memory, so that in extreme cases of CoW, there is sufficient memory to copy all dirty memory pages. Even the extra memory is not used yet when forking the child process, it checks the idle memory to avoid later out-of-memory errors. In the Redis log, it provides the solution:
10202:M 13 Sep 11:33:09.943 # WARNING overcommit_memory is set to 0! Background
save may fail under low memory condition. To fix this issue add
'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the
command 'sysctl vm.overcommit_memory=1' for this to take effect.
So setting 'vm.overcommit_memory = 1' will allow you to fork the child process when the idle memory is low. If you know the dirty memory pages during the backup process won't be too many, there won't be any actual problems because the memory will be allocated successfully every time a new CoW operation happens.
And, 'vm.overcommit_memory = 1' only guarantees that you can fork the child process to backup the Redis data, but it cannot reduce the memory usage if there are writing operations happening all the time in the parent process.
Redis backup practice
There are three ways of persisting the Redis memory data: RDB(snapshotting), AOF, and the hybrid of the two. Any approach will impact the server response time to some extent no matter how you config the settings. To minimize the impact of the persisting process, we normally run the backup in slave instance instead on the master instance. However, there is a new risk if we do it on a slave. When there is network partitions happening, the slave may not be able to keep up-to-date, so backing up on a slave will risk losing some data. One resolution is to have multiple slaves, so the chance of having all of them out-of-sync with the master instance is lowered. Another prevention is setting up robust monitoring system, so we can detect network issues sooner and reduce the time span of the network partition.

From the Redis FAQ:
Redis background saving schema relies on the copy-on-write semantic of the fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory, the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory, all the pages may change while the child process is saving.
The increased memory usage during the save process is dependent on the number of writes performed while the dump is undergoing because of the copy-on-write (COW) mechanism.
What you could do instead is, configure a Redis slave and delegate the task of persistence to it.

Related

Simulate and optimize a scheduler job to drain data from a data center

I have an assignment to simulate a problem which we currently have and that is draining data out of old hard drives. Imagine we have 5 hard disks H1 ... H5. Each has a specific capacity Ci and the remaining space Ri. And, we don't want disks to reach to their full capacity, so we need to come up with a scheduler job which frequently drains data out of a disk and relocates it in some other disks. Now the problem is that this draining process impacts the workflow of our system. The performance of the system can be measured by some metrics lets say M1 and M2. Now, how do I design a draining scheduler which tells me when and how much data should be relocated out of a which disk such that it minimizes the impact on M1 and M2?
I use SimPy to simulate this system in python.
For any realistic and practical scenario; the performance metrics (M1 and M2) will have nothing to do with CPU time or (CPU) scheduling whatsoever. All modern (and most "not modern") disk controllers use DMA/bus mastering to transfer data to/from disk themselves (without using any CPU time to do the transfer) so M1 and M2 will (primarily) depend on disk IO bandwidth and not CPU time.
The device driver for the disk controller should/will support some kind of IO priorities; allowing "when disk controller has nothing more important to do (no higher priority transfers), disk controller driver asks disk controller to transfer data to drain the disk (as pre-arranged by file system layer)". In other words "drain disks when disk is idle" can be achieved merely by using a low IO priority.
However; this alone does not work, and the "only drain when (disk) is idle" idea is fundamentally flawed. The problem is that if the disk is pounded for a long time it can still become full (because the disk controller continually had higher priority work to do), leading to a "no free disk space" critical condition (likely failure). The solution is to make the IO priority of draining depend on how full the disk is. If there's enough remaining space on the disk (more than some threshold), then "IO priority of draining" is the lowest priority (so that it doesn't ruin the performance of normal disk IO); and if there's less free disk space the IO priority of draining is proportional, until you reach "IO priority of draining is highest possible priority because there is no free disk space" (sacrificing performance of normal disk IO to prevent a "no free space at all" critical condition as you approach this extreme). Basically, maybe something like "if(Ri >= threshold) { draining_IO_priority = (1.0 - threshold / Ri) * (max_IO_priority - min_IO_priority) + min_IO_priority; } else { draining_IO_priority = min_IO_priority; }"
Also note that the file system layer (and the disk controller driver and almost everything else except some old user-space APIs) is primarily event driven. When the file system receives a request that would cause disk space to be allocated (e.g. resulting from a process doing a "write()") it responds to the event by deciding if it needs to send a "drain request" to disk controller (in addition to allocating some disk space) or deciding if a previous request needs an IO priority boost; when file system receives a "drain request completed" reply event from disk controller it decides if it needs to send another drain request to disk controller; etc. With this in mind, the file system layer should use a "high CPU scheduler" priority to respond to events quickly (but that has nothing to do with disk IO priorities).
Finally; yes there is an "IO scheduler" (e.g. possibly built into the disk controller's driver); but this is hopefully an extremely trivial "when one transfer completes, find the highest priority pending transfer and do that next" algorithm that doesn't require much thought or complexity. However, for some cases it depends on the device (e.g. for old "rotating mechanical disk" hard drives an attempt to reduce/optimize seek times may be involved).
I guess what I'm trying to say is that, for a well designed system, a "draining scheduler" should not exist at all.

Redis-server using all RAM at startup

i'm using redis and noticed that it crashes with the following error :
MISCONF Redis is configured to save RDB snapshots
I tried the solution suggested in this post
but everything seems to be OK in term of permissions and space.
htop command tells me that redis is consuming 70% of RAM. i tried to stop / restart redis in order to flush but at startup, the amount of RAM used by redis was growing up dramatically and stops around 66%. I'm pretty sure at this moment no processus was using any redis instance !
what happens there ?
The growing up ram issue is an expected behaviour of Redis at first data load, after restarts, writing the data to disk (snapshot process). Redis tends to allocate memory as much as it can unless you don't use "maxmemory" option at your conf file.
It allocates memory but not release immediately. Sometimes it takes hours, I saw such cases.
Well known fact about Redis is that, it can allocate memory up to twice size of the dataset it keeps.
I suggest you to wait couple of hours without any restart (Redis can work in this time, get/set operations etc.) and keep watching the memory.
Please check that too
Redis will not always free up (return) memory to the OS when keys are
removed. This is not something special about Redis, but it is how most
malloc() implementations work. For example if you fill an instance
with 5GB worth of data, and then remove the equivalent of 2GB of data,
the Resident Set Size (also known as the RSS, which is the number of
memory pages consumed by the process) will probably still be around
5GB, even if Redis will claim that the user memory is around 3GB. This
happens because the underlying allocator can't easily release the
memory. For example often most of the removed keys were allocated in
the same pages as the other keys that still exist.

Unable to save in background (redis-server)

I have two redis servers running on the same machine. The second one's log files have several instances with notices such as these:
[50818] 19 Feb 06:41:05.007 * 10 changes in 300 seconds. Saving...
[50818] 19 Feb 06:41:05.007 # Can't save in background: fork: Cannot allocate memory
In contrast, the log files of the first one solely contain successful DB saves. If I were out of memory, I reckon both would have similar logs. It perplexes me that only one has this problem, the other doesn't. Any leads?
Moreover, research led me to this blog post, which contends that the issue can be ameliorated if I do sysctl vm.overcommit_memory=1 on the command line. There's no explanation of how this helps. Can someone explain what's going on here in context of redis?
As Per Redis FAQs :
Background saving is failing with a fork() error under Linux even if I've a lot of free RAM!
Short answer: echo 1 > /proc/sys/vm/overcommit_memory :)
And now the long one:
Redis background saving schema relies on the copy-on-write semantic of
fork in modern operating systems: Redis forks (creates a child
process) that is an exact copy of the parent. The child process dumps
the DB on disk and finally exits. In theory the child should use as
much memory as the parent being a copy, but actually thanks to the
copy-on-write semantic implemented by most modern operating systems
the parent and child process will share the common memory pages. A
page will be duplicated only when it changes in the child or in the
parent. Since in theory all the pages may change while the child
process is saving, Linux can't tell in advance how much memory the
child will take, so if the overcommit_memory setting is set to zero
fork will fail unless there is as much free RAM as required to really
duplicate all the parent memory pages, with the result that if you
have a Redis dataset of 3 GB and just 2 GB of free memory it will
fail. Setting overcommit_memory to 1 says Linux to relax and perform
the fork in a more optimistic allocation fashion, and this is indeed
what you want for Redis.
A good source to understand how Linux Virtual Memory work and other
alternatives for overcommit_memory and overcommit_ratio is this
classic from Red Hat Magazine, "Understanding Virtual Memory". Beware,
this article had 1 and 2 configuration values for overcommit_memory
reversed: refer to the proc(5) man page for the right meaning of
the available values.

Redis snapshot overloading memory

I'm using redis as a client side caching mechanism.
Implemented with C# using stackexchange.redis.
I configured the snapshotting to "save 5 1" and rdbcompression is on.
The RDB mechanism loads the rdb file to memory every time it needs to append data.
The problem is when you have a fairly large RDB file and it's loaded to memory all at once. It chokes up the memory, disk and cpu for the average endpoint.
Is there a way to update the rdb file without loading the whole file to memory?
Also any other solution that lowers the load on the memory and cpu is welcome.
The RDB mechanism loads the rdb file to memory every time it needs to append data.
This isn't what the open source Redis server does (other variants, such as the MSFT fork, may behave differently) - RDBs are created by copying the contents of the memory to disk with a forked process. The dump's file is never loaded, except when used for recovery. The increased memory usage during the save process is dependent on the amount of writes performed while the dump is undergoing because of the copy-on-write (COW) mechanism.
Also any other solution that lowers the load on the memory and cpu is welcome.
There are several ways to tackle this, depending on your requirements and budget. These include:
Using both RDB and AOF for data persistency, thus reducing the frequency of dumps.
Delegating persistency to a slave instance.
Sharding your databases and performing cascading dumps.
We tackled the problem by using RDB and now use AOF exclusively.
We have reduced the memory peaks by reducing the auto-aof-rewrite-percentage and also limiting the auto-aof-rewrite-min-size to the desired size.

Redis - Default blocking VM

The blocking VM performance is better overall, as there is no time lost in
synchronization, spawning of threads, and resuming blocked
clients waiting for values. So if you are willing to accept an higher
latency from time to time, blocking VM can be a good pick. Especially
if swapping happens rarely and most of your often accessed data
happens to fit in your memory.
This is default mode of Redis (and the only mode going forward I believe now VM is deprecated in 2.6), leaving the OS to handle paging (if/when required). I am correct in my understanding that it will take some time to get "hot" when booted/started. When working on a 1gb RAM node with a 16gb dataset, does Redis attempt to load it all into virtual memory at boot and thus 90%+ is immediately paged out, and only after some good amount of usages does the above statement hold true?
Redis VM was already deprecated in Redis 2.4, and has been removed in Redis 2.6. It is a dead end: don't use it.
I think you are confusing the blocking VM with OS paging. They are two different things.
OS paging is the default mode of Redis when Redis VM is not configured at all (whatever the blocking mode). The OS will swap Redis memory if it does not fit in physical memory. The event loop can be frozen at any time. When it happens, performance is abysmal because none of the Redis internal data structures is designed for this (no locality, no paging system).
Redis VM can be configured in non blocking mode (using I/O threads). When I/Os are done, the event loop is not blocked, and Redis is still responsive. However, when too many I/Os pile up, the I/O threads will be completely busy, and you end up with a responsive Redis, but unable to process any queries requiring I/Os.
Redis VM can also be configured in blocking mode. In this mode all I/Os are synchronously performed in the main event loop thread. So the event loop is frozen in case of I/O (for instance in case of a key miss). All clients are impacted. However, general performance (CPU consumption and latency) is better than with the non blocking mode because some threading scheduling/synchronization is saved.
In practice, the difference between OS paging and the Redis blocking VM is the granularity level. With Redis VM, the granularity is the key. With OS paging, well it is the page (a 4 KB block which can span on several unrelated keys).
In all 3 cases, the initial load of the dump file will be extremely slow and generate a peak of random I/Os on your system. As you pointed out, most objects will be loaded and then swapped out. The warm-up time will be significant.
Except if you have extreme locality in your data, or if you do not care at all about the latencies, using 1 GB RAM for a 16 GB dataset with the Redis VM is science-fiction IMO.
There is a reason why the Redis VM was phased out. By design, it will never perform as well as a disk-based datastore (which can exploit file mapping or direct I/Os to avoid the double buffering, and use adapted data structures like B-trees).
Redis as an in-memory store is excellent. But if you need to store something which is bigger than RAM, don't use it. Other (disk-based) stores will all perform much better.