Migrate memory pages of a running process - process

On a NUMA machine, is it possible to migrate memory pages of a running process to one node?
P.S: I know taskset can change the affinity in runtime, but there's no documentation says how the already allocated memory pages are affected.
Numactl can only works when creating process, as far as I know.

There is such call in the libnuma library (numactl package, since 2003): http://linux.die.net/man/3/numa
void numa_tonode_memory(void *start, size_t size, int node);
numa_tonode_memory() put memory on a specific node.
It may be implemented with mbind call with MPOL_MF_MOVE option: http://man7.org/linux/man-pages/man2/mbind.2.html
mbind - set memory policy for a memory range
If MPOL_MF_MOVE is specified in flags, then the kernel will attempt
to move all the existing pages in the memory range so that they
follow the policy.
https://www.kernel.org/doc/Documentation/vm/page_migration
Page migration allows a process to manually relocate the node on which its
pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
a new memory policy via mbind().
Or with move_pages: http://man7.org/linux/man-pages/man2/move_pages.2.html
"move_pages - move individual pages of a process to another node"

Related

ChronicleMap Recovery with multi process application

We are evaluating ChronicleMap and our application runs cluster mode with nodes ranging from 5 to 45. The plan is to have the ChronicleMap persisted in shared NFS folder so that all the nodes can read/write.
There are more likely chance that individual nodes could go down for various reasons in the middle of a read/write operation with this said. I have some questions
If node-1 goes down during a write operation, can another healthy node-2 in the cluster still continue to read/write to the files?
Lets say we implement some logic to detect a server crash and call the .recoverPersistedTo() on restart. Will this cause any issues while other healthy nodes in the cluster are reading/writing to the files? The reason I ask this question is that the document says
“You must ensure that no other process is accessing the Chronicle Map
store when calling .recoverPersistedTo()”
I have read that using .recoverPersistedTo() in place is createPersistedTo() is not a good practice, but what are the downsides?
First of all, we (Chronicle) don't support putting Chronicle Map files on NFS (as we use memory mapping and NFS is known to cause problems with it). Additionally, trying to use recovery on NFS will cause data corruption as there's no adequate file locking on NFS, and recovery tries to lock the file to prevent simultaneous recovery by multiple processes. In general, open source Chronicle Map is supposed to be used by multiple processes on the same host.
The solution to your problem is commercial Map Enterprise which supports map replication between nodes, please contact sales#chronicle.software for details.

How do operating systems isolate processes from each other?

Assuming the CPU is in protected mode:
When a ring-0 kernel sets up a ring-3 userspace process, which CPU-level datastructure does it have to modify to indicate which virtual address space this specific process can access?
Does it just set the Privilege Bit of all other memory segments in the Global Descriptor Table to (Ring) 0?
Each process will have a set of page tables it uses. On x86 that means a page directory with some page tables. The address to the page directory will be in the CR3 Register. Every set of pagetables will have the kernel mapped (with kernel permissions) so when you do a system call, the kernel can access it's own pages. User processes can't access these pages. When you do a context switch, you change the address in the CR3 register to the page tables of the process that will be executed. Because each process has a different set of pagetables, they will each have a different view on memory. To make sure that no two processes have access to the same physical memory, you should have some kind of physical memory manager, which can be queried for a brand new area of memory that is not yet mapped in any other pagetable.
So as long as each Process struct keeps track of it's own page table structure, the only cpu level datastructure you will have to modify is the CR3 register.
It appears that the Global Descriptor Table (GDT) provides a segmentation mechanism that can be used in conjunction with Paging, but is now considered legacy.
By loading the page directory address into the CR3 control register, the Ring 3 process is restricted to the linear memory defined by the paging mechanism. CR3 can only be changed from Ring 0:
In protected mode, the 2 CPL bits in the CS register indicate which ring/privilege level the CPU is on.
More here:
https://forum.osdev.org/viewtopic.php?f=1&t=31835
https://wiki.osdev.org/Paging
https://sites.google.com/site/masumzh/articles/x86-architecture-basics/x86-architecture-basics
https://en.wikipedia.org/wiki/X86_memory_segmentation
https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4

Redis stream 50k consumer support parallel - capacity requirement

What are the Redis capacity requirements to support 50k consumers within one consumer group to consume and process the messages in parallel? Looking for testing an infrastructure for the same scenario and need to understand considerations.
Disclaimer: I worked in a company which used Redis in a somewhat large scale (probably less consumers than your case, but our consumers were very active), however I wasn't from the infrastructure team, but I was involved in some DevOps tasks.
I don't think you will find an exact number, so I'll try to share some tips and tricks to help you:
Be sure to read the entire Redis Admin page. There's a lot of useful information there. I'll highlight some of the tips from there:
Assuming you'll set up a Linux host, edit /etc/sysctl.conf and set a high net.core.somaxconn (RabbitMQ suggests 4096). Check the documentation of tcp-backlog config in redis.conf for an explanation about this.
Assuming you'll set up a Linux host, edit /etc/sysctl.conf and set vm.overcommit_memory = 1. Read below for a detailed explanation.
Assuming you'll set up a Linux host, edit /etc/sysctl.conf and set fs.file-max. This is very important for your use case. The Open File Handles / File Descriptors Limit is essentially the maximum number of file descriptors (each client represents a file descriptor) the SO can handle. Please check the Redis documentation on this. RabbitMQ documentation also present some useful information about it.
If you edit the /etc/sysctl.conf file, run sysctl -p to reload it.
"Make sure to disable Linux kernel feature transparent huge pages, it will affect greatly both memory usage and latency in a negative way. This is accomplished with the following command: echo never > /sys/kernel/mm/transparent_hugepage/enabled." Add this command also to /etc/rc.local to make it permanent over reboot.
In my experience Redis is not very resource-hungry, so I believe you won't have issues with CPU. Memory are directly related to how much data you intend to store in it.
If you set up a server with many cores, consider using more than one Redis Server. Redis is (mostly) single-threaded and will not use all your CPU resources if you use a single instance in a multicore environment.
Redis server also warns about wrong/risky configurations on startup (sorry for the old image):
Explanation on Overcommit Memory (vm.overcommit_memory)
Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis [from Redis FAQ]
There are three possible settings for vm.overcommit_memory.
0 (zero): Check if enough memory is available and, if so, allow the allocation. If there isn’t enough memory, deny the request and return an error to the application.
1 (one): Permit memory allocation in excess of physical RAM plus swap, as defined by vm.overcommit_ratio. The vm.overcommit_ratio parameter is a
percentage added to the amount of RAM when deciding how much the kernel can overcommit. For instance, a vm.overcommit_ratio of 50 and 1 GB of
RAM would mean the kernel would permit up to 1.5 GB, plus swap, of memory to be allocated before a request failed.
2 (two): The kernel’s equivalent of "all bets are off", a setting of 2 tells the kernel to always return success to an application’s request for memory. This is absolutely as weird and scary as it sounds.

Couchbase 3.1.0 - Hard out of memory error when performing full backup

We recently migrated to Couchbase 3.1.0. The odd thing is - when performing full backup of a bucket, web UI alerts "Hard Out Of Memory Error. Bucket X on node Y is full. All memory allocated to this bucket is used for metadata". The numbers from RAM usage in the web UI contradict that - about 75% is used, but not 100%. I looked into the logs, but haven't find any similar errors there.
Is that even normal?
This is a known issue in the Couchbase Server 3.x releases.
To understand the problem, we must also first understand Database Change Protocol (DCP), the protocol used to transfer data throughout the system. At a high level the flow-control for DCP is as follows:
The Consumer creates a connection with the Producer and sends an Open Connection message. The Consumer then sends a Control message to indicate per stream flow control. This messages will contain “stream_buffer_size” in the key section and the buffer size the Consumer would like each stream to have in the value section.
The Consumer will then start opening streams so that is can receive data from the server.
The Producer will then continue to send data for the stream that has buffer space available until it reaches the maximum send size.
Steps 1-3 continue until the connection is closed, as the Consumer continues to consume items from the stream.
The cbbackup utility does not implement any flow control (data buffer limits) however, and it will try to stream all vbuckets from all nodes at once, with no cap on the buffer size.
While this does not mean that it will use the same amount of memory as your overall data size (as the streams are being drained slowly by the cbbackup process), it does mean that a large memory overhead is required to be able to store the data streams.
When you are in a heavy DGM (disk greater than memory) scenario, the amount of memory required to store the streams is likely to grow more rapidly than cbbackup can drain them as it is streaming large quantities of data off of disk, leading to very large streams, which take up a lot of memory as previously mentioned.
The slightly misleading message about metadata taking up all of the memory is displayed as there is no memory left for the data, so all of the remaining memory is allocated to the metadata, which when using value eviction cannot be ejected from memory.
The reason that this only affects Couchbase Server versions prior to 4.0 is that in 4.0 a server-side improvement to DCP stream management was made that allows the pausing of DCP streams to keep the memory footprint down, this is tracked as MB-12179.
As a result, you should not experience the same issue on Couchbase Server versions 4.x+, regardless of how DGM your bucket is.
Workaround
If you find yourself in a situation where this issue is occurring, then terminating the backup job should release all of the memory consumed by the streams immediately.
Unfortunately if you have already had most of your data evicted from memory as a result of the backup, then you will have to retrieve a large quantity of data off of disk instead of RAM for a small period of time, which is likely to increase your get latencies.
Over time 'hot' data will be brought into memory when requested, so this will only be a problem for a small period of time, however this is still a fairly undesirable situation to be in.
The workaround to avoid this issue completely is to only stream a small number of vbuckets at once when performing the backup, as opposed to all vbuckets which cbbackup does by default.
This can be achieved using cbbackupwrapper which comes bundled with all Couchbase Server releases 3.1.0 and later, details of using cbbackupwrapper can be found in the Couchbase Server documentation.
In particular the parameter to pay attention to is the -n flag, which specifies the number of vbuckets to be backed up in a batch at once.
As the name suggests, cbbackupwrapper is simply a wrapper script on top of cbbackup which partitions the vbuckets up and automatically handles all of the directory creation and backup generation, while still using cbbackup under the hood.
As an example, with a batch size of 50, cbbackupwrapper would backup vbuckets 0-49 first, followed by 50-99, then 100-149 etc.
It is suggested that you test with cbbackupwrapper in a testing environment which mirrors your production environment to find a suitable value for -n and -P (which controls how many backup processes run at once, the combination of these two controls the amount of memory pressure caused by backup as well as the overall speed).
You should not find that lowering the value of -n from its default 100 decreases the backup speed, in some cases you may find that the backup speed actually increases due to the fact that there is far less memory pressure on the server.
You may however wish to sensibly adjust the -P parameter if you wish to speed up the backup further.
Below is an example command:
cbbackupwrapper http://[host]:8091 [backup_dir] -u [user_name] -p [password] -n 50
It should be noted that if you use cbbackupwrapper to perform your backup then you must also use cbrestorewrapper to restore the data, as cbrestorewrapper is automatically aware of the directory structures used by cbbackupwrapper.
When you run a full backup, by default the backup tool streams data from all nodes over the network. This is not the best way, because it causes a lot of extra load and increased memory usage, especially of you run cbbackup on one of the Couchbase nodes. I would use the data-copy mode of cbbackup, which copies data directly from the files on disk:
> sudo /opt/couchbase/bin/cbbackup couchstore-files:///opt/couchbase/var/lib/couchbase/data/ /tmp/backup
Of course, change the data path to wherever your Couchbase data is actually stored. (In my example it runs as sudo because only root has read access to /opt/couchbase/blabla..) Do this on every node, then collect all the backup folders and put them somewhere. Note that the backups are very compressible, so you might want to zip them before copying over the network.

The meaning of evict() in infinispan cache

According to the docs for infinispan: http://docs.jboss.org/infinispan/5.0/apidocs/ the evict() API does not remove the entry from any other cache stores in the cluster, on the cache store it was invoked on.
If using "replication" mode, where the data is replicated across the caches, surely it has to be consisted and using the evict() API will make it inconsistent.
How then is the inconsistency resolved?
Thanks
Evict removes the entry only from the memory on the node where you call it. It does not make the cache inconsistent, because if you call cache.get() and the entry is not found in memory, it is loaded from cache store.
As the documentation states, the purpose is to inform cache that it won't use the entry for some time and the node can free some memory.