*** buffer overflow detected *** /usr/bin/expect terminated

*** buffer overflow detected *** /usr/bin/expect terminated - ssh

I am running a tcl script which invokes ssh sessions to multiple servers and keeps it alive for further operations. But I am getting this below error after it does ssh to 1023 Servers.
I have set soft and hard limits to larger value, but still no luck.
Below is the Server limit.conf details and multiple options that I have tried.
Ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31189
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 268435456
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 65536
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
cat /proc/sys/fs/file-max
65536
cat /etc/pam.d/login also contains ==> session required /lib64/security/pam_limits.so
Tried multiple approach including the links https://docs.oracle.com/cd/E19623-01/820-6168/file-descriptor-requirements.html
Please help.

The issue is that you're going over the limit of what the Tcl event notifier can handle, as that is based (on Unix) on the select() system call which has a limit on the maximum FD number supported. In particular, it is limited by the size of a structure on the stack that is set at compile time to (usually) 1024 entries, and Tcl reserves a few file descriptors for other things (e.g., standard in and out). This is a limit on the maximum value of the FD, and not just on the maximum number of open descriptors. Going over the maximum does a buffer overrun in the stack (in a hard-to-control way) and trips the memory protection code that produces the error you saw. (If using Tcl 8.5 or 8.6, you might want to ulimit -n 1024 to convert a nasty crash into a much nicer one.)
We have fixed this in 8.7 by switching to other system calls that don't have the limitation (8.7a3 was released last month) and despite it being alpha it should be stable enough for this sort of thing. But if that won't do, you have to split your workload of ssh sessions up between multiple processes; perhaps 512 each will work or maybe half that if that still hits an API limit? Having two or four manager processes instead of one (you can easily have a master process controlling the others over pipes) isn't a great increase in load on the average modern computer. Or even the average one from a decade or two ago.
Another approach that might be viable in some cases is to run screen inside those ssh sessions so that you can disconnect from them (without losing what was going on) and so keep the number of simultaneous in-use FDs down. That isn't a universal solution.

Related

Is there a standard on number of Ignite caches per cluster?

I know that more number of caches open too many file descriptors and consume more resources.
Is there a recommendation on number of caches per ignite/GridGain cluster?
Is there a recommendation on number of caches vs number of nodes vs OS configuration (CPU, RAM)
We have 45 caches (PARTITIONED) and the system configuration is 4 CPU and 60GB RAM on each node. It is a 3 node cluster.
Current data storage size is 2GB and Data is expected to grow 1.5-2 TB in next one year.
We are frequently getting "Too many open files" error.

First of all there's nothing wrong with increasing the limit for file descriptors on the OS level. You can use the ulimit utility for that.
Another option is to leverage cache groups, it will make caches share some structures including files.

Telemetry size for MarkLogic 9

We had recently upgraded our 8 host MarkLogic cluster from ML8 to ML9 (9.0-8.2). We enabled telemetry for our cluster and we did see around 1 MB data is getting uploaded to telemetry in ErrorLog.txt. We had a system /tmp space of 5GB but whereas in Memory,Disk, Swap space requirements (https://docs.marklogic.com/9.0/guide/installation/intro#id_11335) it says
"System temp space sizing - when using telemetry allow for 20 GB maximum in system temp space, although normal usage will likely be less than 100 MB."
My concern is do we face some performance issues if we didn't increase /tmp to 20 GB?

Behaviour of redis client-output-buffer-limit during resynchronization

I'm assuming that during replica resynchronisation (full or partial), the master will attempt to send data as fast as possible to the replica. Wouldn't this mean the replica output buffer on the master would rapidly fill up since the speed the master can write is likely to be faster than the throughput of the network? If I have client-output-buffer-limit set for replicas, wouldn't the master end up closing the connection before the resynchronisation can complete?

Yes, Redis Master will close the connection and the synchronization will be started from beginning again. But, please find some details below:
Do you need to touch this configuration parameter and what is the purpose/benefit/cost of it?
There is a zero (almost) chance it will happen with default configuration and pretty much moderate modern hardware.
"By default normal clients are not limited because they don't receive data
without asking (in a push way), but just after a request, so only asynchronous clients may create a scenario where data is requested faster than it can read." - the chunk from documentation .
Even if that happens, the replication will be started from beginning but it may lead up to infinite loop when slaves will continuously ask for synchronization over and over. Redis Master will need to fork whole memory snapshot (perform BGSAVE) and use up to 3 times of RAM from initial snapshot size each time during synchronization. That will be causing higher CPU utilization, memory spikes network utilization (if any) and IO.
General recommendations to avoid production issues tweaking this configuration parameter:
Don't decrease this buffer and before increasing the size of the buffer make sure you have enough memory on your box.
Please consider total amount of RAM as snapshot memory size (doubled for copy-on-write BGSAVE process) plus the size of any other buffers configured plus some extra capacity.
Please find more details here

Redis memory usage vs space taken up by back ups

I'm looking at Redis backed up rdb files for a web application. There are 4 such files (for 4 different redis servers working concurrently), sizes being: 13G + 1.6G + 66M + 14M = ~15G
However, these same 4 instances seem to be taking 43.8GB of memory (according to new relic). Why such a large discrepancy between how much space redis data takes up in mem vs disk? Could it be a misconfiguration and can the issue be helped?

I don't think there is any problem.
First of all, the data is stored in compressed format in rdb file so that the size is less than what it is in memory. How small the rdb file is depends on the type of data, but it can be around 20-80% of the memory used by redis
Another reason your memory usage could be more than the actual usage(you can compare the memory from new relic to the one obtained from redis-cli info memory command) is because of memory fragmentation. Whenever redis needs more memory, it will get the memory allocated from the OS, but will not release it easilyly(when the key expires or is deleted). This is not a big issue, as redis will ask for more memory only after using the extra memory that it has. You can also check the memory fragmentation using redis-cli info memory command.

What is the recommended max value for Max Connections Per Child in Apache configuration?

I am traying to reduce memory usage by Apache on the server.
My actual Max Connections Per Child is 10k
According to the following recommendation
the Max Connections Per Child should be reduced to 1000
http://www.lophost.com/tutorials/how-to-reduce-high-memory-usage-by-apache-httpd-on-a-cpanel-server/
What is the recommended max value for Max Connections Per Child in Apache configuration?

The only time when this directive affects anything is when your Apache workers are leaking memory. One way this happens is that memory is allocated (via malloc() or whatever) and never freed. It's the result of design/implementation flaws in Apache or its modules.
This directive is somewhat of a hack, really -- but if there's some module that's loaded into Apache that leaks, say, 8 bytes every request, then after a lot of requests, you'll run out of memory. So the quick fix is to just kill the process every MaxConnectionsPerChild requests and start a new one.
This will only affect your memory usage if you see it gradually increase over the span of lots of requests when setting MaxConnectionsPerChild to zero.

The default is 0 (which implies no maximum connections per child) so unless you have memory leakage I'm unaware of any need to change this setting - I agree with Hut8.
Sharing here FYI from the Apache 2.4 Performance Tuning page:
Related to process creation is process death induced by the MaxConnectionsPerChild setting. By default this is 0, which means that there is no limit to the number of connections handled per child. If your configuration currently has this set to some very low number, such as 30, you may want to bump this up significantly. If you are running SunOS or an old version of Solaris, limit this to 10000 or so because of memory leaks.
And from the Apache 2.4 docs on MaxConnectionsPerChild:
Setting MaxConnectionsPerChild to a non-zero value limits the amount of memory that process can consume by (accidental) memory leakage.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas