Aerospike connection errors - aerospike

We run Aerospike server 3.5.15-1 on Ubuntu 14.04 and periodically getting server connection errors from PHP clients ([-1]Unable to connect to server). PHP client version 3.4.1. We run PHP 5.3 clients from a separate server node. Connections created from php-fpm.
There are no any corresponding errors in the server logs and server didn't have to be restarted. So, the problem seem to be on the client side.
This application creates up to 400 simultaneous connections to Aerospike. We use r3.xlarge EC2 instance and server has plenty of available resources.
We followed Aerospike tuning documentation and tried updating proto-fd and recommended OS patameters on the server, but it didn't help
proto-fd-max 100000
proto-fd-idle-ms 15000
That's how we initialize and use Aerospike:
$opts = array(Aerospike::OPT_CONNECT_TIMEOUT => 1250,Aerospike::OPT_WRITE_TIMEOUT => 5000);
$this->db = new Aerospike($config, false, $opts);
//set key
$aero_key = $this->db->initKey($this->keyspace, $this->table, $key);
$aero_value = array("value" => $value);
$status = $this->db->put($aero_key, $aero_value, $ttl, $options);
//get key
$aero_key = $this->db->initKey($this->keyspace, $this->table, $key);
$status = $this->db->get($aero_key, $result);
Aerospike server stats before the disconnect:
Aug 27 2015 19:32:50 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (237, 16073516, 16073279) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:00 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (334, 16076516, 16076182) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:10 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 1 ::: dq 0 : fds - proto (288, 16079478, 16079190) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:20 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (131, 16082477, 16082346) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:30 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (348, 16084665, 16084317) : hb (0, 0, 0)

From the log segment, we can see that there are around 300 client connections open on the node at any one time, well under the 100000 limit in proto-fd-max.
If you are using multicast for heartbeats (and I think you are), the heartbeats of 0 are fine.
I expect that you have already looked at this, but are you able to check network connectivity between the client and server at the time of the failure? I know that under normal conditions, the client and the server happily coexist, but at the time of the failure, do you see any basic connectivity problems?
Do you happen to have other applications installed on the client machine? Do they have any similar failures, possibly at the time of the Aerospike client problems?
Do you have the client installed on more than one server? Do you maybe only see the connectivity errors on one of the servers?
I know you have already been looking at this, so I apologize if I am covering topics that you have already reviewed.
Thank you for your time,
-DM

Related

Flume not closing all files when adding it successively

Here is my flume conf
agent.sinks = s3hdfs
agent.sources = MySpooler
agent.channels = channel
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3a://testbucket/test
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.fileType = DataStream
agent.sinks.s3hdfs.channel = channel
agent.sinks.s3hdfs.hdfs.useLocalTimeStamp = true
agent.sinks.s3hdfs.hdfs.rollInterval = 0
agent.sinks.s3hdfs.hdfs.rollSize = 0
agent.sinks.s3hdfs.hdfs.rollCount = 0
agent.sinks.s3hdfs.hdfs.idleTimeout = 15
agent.sources.MySpooler.channels = channel
agent.sources.MySpooler.type = spooldir
agent.sources.MySpooler.spoolDir = /flume_to_aws
agent.sources.MySpooler.fileHeader = false
agent.sources.MySpooler.deserializer.maxLineLength = 110000
agent.channels.channel.type = memory
agent.channels.channel.capacity = 100000000
When I add a file in /flume_to_aws and wait for it, it is uploaded in amazon s3 and file is closed normally.
[root#de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00001.csv .
log:
06 Feb 2023 14:02:11,802 INFO [hdfs-s3hdfs-roll-timer-0] (org.apache.flume.sink.hdfs.BucketWriter.doClose:438) - Closing s3a://testbucket/test/FilePrefix.1675699321675.tmp
06 Feb 2023 14:02:13,599 INFO [hdfs-s3hdfs-call-runner-4] (org.apache.flume.sink.hdfs.BucketWriter$7.call:681) - Renaming s3a://testbucket/test/FilePrefix.1675699321675.tmp to s3a://testbucket/test/FilePrefix.1675699321675
But when I add several files without wait, it does not upload all files
ie:
[root#de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00001.csv .
[root#de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00002.csv .
[root#de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00003.csv .
log (only one file).
06 Feb 2023 14:02:27,842 INFO [hdfs-s3hdfs-roll-timer-0] (org.apache.flume.sink.hdfs.BucketWriter.doClose:438) - Closing s3a://testbucket/test/FilePrefix.1675699338165.tmp
06 Feb 2023 14:02:31,411 INFO [hdfs-s3hdfs-call-runner-0] (org.apache.flume.sink.hdfs.BucketWriter$7.call:681) - Renaming s3a://testbucket/test/FilePrefix.1675699338165.tmp to s3a://testbucket/test/FilePrefix.1675699338165
In s3 I only see one file. Why this happen?
I misunderstood the concept.
Actually, it is working fine. Flume seems to work doing something called "roll". Those 3 files are rolled together, especially because those 3 parameters.
agent.sinks.s3hdfs.hdfs.rollInterval = 0
agent.sinks.s3hdfs.hdfs.rollSize = 0
agent.sinks.s3hdfs.hdfs.rollCount = 0
Since there is no interval to roll (rollInterval), no size to roll (rollSize) and no event count to roll (rollCount), it will roll those files together and store all the files in a single file in s3 after the timeout agent.sinks.s3hdfs.hdfs.idleTimeout = 15.
In my case, now I am using agent.sinks.s3hdfs.hdfs.rollSize = 2097152, so it will roll when the file reaches 2mb. In this case the size of those three files are:
[root#de flume_to_aws]# du -sk /tmp_flume/globalterrorismdb_0522dist.00001.csv
1532 /tmp_flume/globalterrorismdb_0522dist.00001.csv
[root#de flume_to_aws]# du -sk /tmp_flume/globalterrorismdb_0522dist.00002.csv
1040 /tmp_flume/globalterrorismdb_0522dist.00002.csv
[root#de flume_to_aws]# du -sk /tmp_flume/globalterrorismdb_0522dist.00003.csv
908 /tmp_flume/globalterrorismdb_0522dist.00003.csv
1532kb + 1040kb + 908kb = 3,480 (3.4mb)
As I am setting it to roll after 2mb, it will store 2 files in s3.
as we can see, the size of the files in s3 match with the above sum.
2mb + 1.4mb = 3.4mb
Please, I just leaned that. Leave a feedback if something is wrong.

Odoo14 service keep crash (Dumping stacktrace of limit exceeding threads before reloading)

i face the following error
2022-05-30 15:00:26,943 1940 WARNING ? odoo.service.server: Server memory limit (4934283264) reached.
2022-05-30 15:00:26,954 1940 INFO ? odoo.service.server: Dumping stacktrace of limit exceeding threads before reloading
2022-05-30 15:00:26,997 1940 INFO ? odoo.tools.misc:
# Thread: <_MainThread(MainThread, started 140592199739200)> (db:n/a) (uid:n/a) (url:n/a)
File: "/opt/odoo/odoo14/odoo-bin", line 8, in <module>
odoo.cli.main()
File: "/opt/odoo/odoo14/odoo/cli/command.py", line 61, in main
o.run(args)
File: "/opt/odoo/odoo14/odoo/cli/server.py", line 178, in run
main(args)
File: "/opt/odoo/odoo14/odoo/cli/server.py", line 172, in main
rc = odoo.service.server.start(preload=preload, stop=stop)
File: "/opt/odoo/odoo14/odoo/service/server.py", line 1298, in start
rc = server.run(preload, stop)
File: "/opt/odoo/odoo14/odoo/service/server.py", line 546, in run
dumpstacks(thread_idents=[thread.ident for thread in self.limits_reached_threads])
File: "/opt/odoo/odoo14/odoo/tools/misc.py", line 957, in dumpstacks
for line in extract_stack(stack):
2022-05-30 15:00:27,007 1940 INFO ? odoo.service.server: Initiating server reload
and i tried several solutions like increase
limit_request = 8192
limit_time_cpu = 600
limit_time_real = 1200
max_cron_threads = 1
limit_memory_hard = 536870637100
limit_memory_soft = 483183573400
but still facing same issue as error log, also i try to run the server after 30 mins as maximum i got same error again & again..
Best Regards.
Brother take a look at this link Configuration suggestions for Odoo server
If you have a VPS with 4 CPU cores and 16 GB of RAM, the number of workers should be 9 (CPU cores * 2 + 1), total limit-memory-soft value will be 640MB x 9 = 5760 MB , and total limit-memory-hard 768MB x 9 = 6912 MB,
so Odoo will use maximum 5.4 GB of RAM.
You server is 4vCPUs so try the below in your config file:
limit_memory_hard = 640MB * 9 * 1024 * 1024 = 7247757312
limit_memory_soft = 768MB * 9 * 1024 * 1024 = 6039797760
max_cron_threads = 1
workers = 8

Apache httpd poll() takes 38 ms

I have an Apache (OHS) httpd process (1 out of 8 actually) talking to 2 web entry servers (WES), both on RedHat. Plotting the response times taken from the respective logfiles shows a constant delta of around 50 ms between both. Using strace strace -o <trace output> -ttT -s 2048 -f -xx -p <Pid> I found that in 85% of the requests (where encrypted transfer is involved) the httpd process is somehow stuck in poll(), which returns only after 38.something ms. The remaining 10 ms are mainly due to excessive gettimeofday() and other time() related system calls. The WES on the other hand claims he could send the data within below 100 µs but accuses "resource temporarily unavailable" via recvfrom() and now poll()s on his side for some 50 ms before recvfrom() finishes (with the confirmation of the data transfer from the Apache, I suppose).
WES:
40685 16:54:57.111496 poll([{fd=39, events=POLLOUT|POLLWRNORM}], 1, 300000) = 1 ([{fd=39, revents=POLLOUT|POLLWRNORM}]) <0.000071>
40685 16:54:57.111666 sendto(39, " <encrypted data> ) ", 1053, 0, NULL, 0) = 1053 <0.000067>
40685 16:54:57.112249 recvfrom(39, 0x7ff8380c0243, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000061>
40685 16:54:57.112405 poll([{fd=39, events=POLLIN}], 1, 300000 <unfinished ...>
40685 16:54:57.165084 <... poll resumed> ) = 1 ([{fd=39, revents=POLLIN}]) <0.052659>
40685 16:54:57.165177 recvfrom(39, " ", 1, MSG_PEEK, NULL, NULL) = 1 <0.000114>
40685 16:54:57.165388 ioctl(39, FIONREAD, [205]) = 0 <0.000078>
Apache (OHS):
63195 16:54:57.145220 <... poll resumed> ) = 1 ([{fd=22, revents=POLLIN}]) <0.038408>
63195 16:54:57.145299 read(22, " <encrypted data> ", 8000) = 1053 <0.000018>
63195 16:54:57.145739 clock_gettime(CLOCK_REALTIME, {1435589697, 145769536}) = 0 <0.000018>
63195 16:54:57.145810 gettimeofday({1435589697, 145826}, NULL) = 0 <0.000025>
63195 16:54:57.145879 clock_gettime(CLOCK_REALTIME, {1435589697, 145902010}) = 0 <0.000021>
63195 16:54:57.145960 clock_gettime(CLOCK_REALTIME, {1435589697, 145986422}) = 0 <0.000017>
I have two questions:
Is the information in the trace output sufficient to identify the cause of the excessive poll() 38 ms (and obviously what is it then)?
What can be done to improve tracing in case 1) must be answered with no?

iSpin LTL property evaluation only with activated "assertion violations"?

I am trying to get used to iSpin/Promela. I am using:
Spin Version 6.4.3 -- 16 December 2014,
iSpin Version 1.1.4 -- 27 November 2014,
TclTk Version 8.6/8.6,
Windows 8.1.
Here is an example where I try to use LTL. The verification of the LTL property should produce an error if the two steps in the for loop are non-atomic:
1 #define ten ((n !=10) && (finished == 2))
2
3 int n = 0;
4 int finished = 0;
5 active [2] proctype P() {
6 //assert(_pid == 0 || _pid == 1);
7
8 int t = 0;
9 byte j;
10 for (j : 1 .. 5) {
11 atomic {
12 t = n;
13 n = t+1;
14 }
15 }
16 finished = finished+1;
17 }
18
19 ltl alwaysten {[] ! ten }
In the verification tap I just want to test the LTL property, so I disable all safety properties and activate "use claim". The claim name is "alwaysten".
But it seems that the LTL property is just evaluated if I activate "assertion violations". Why? A collegue is using iSpin v1.1.0 and he does not need to activate this? What am I doing wrong? I want to prove assertions and LTL properties independently...
Here is the trace:
pan: elapsed time 0.002 seconds
To replay the error-trail, goto Simulate/Replay and select "Run"
spin -a 1_2_ConcurrentCounters_8.pml
ltl alwaysten: [] (! (((n!=10)) && ((finished==2))))
C:/cygwin/bin/gcc -DMEMLIM=1024 -O2 -DXUSAFE -w -o pan pan.c
./pan -m10000 -E -a -N alwaysten
Pid: 6980
warning: only one claim defined, -N ignored
(Spin Version 6.4.3 -- 16 December 2014)
+ Partial Order Reduction
Full statespace search for:
never claim + (alwaysten)
assertion violations + (if within scope of claim)
acceptance cycles + (fairness disabled)
invalid end states - (disabled by -E flag)
State-vector 36 byte, depth reached 57, errors: 0
475 states, stored
162 states, matched
637 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.024 equivalent memory usage for states (stored*(State-vector + overhead))
0.291 actual memory usage for states
64.000 memory used for hash table (-w24)
0.343 memory used for DFS stack (-m10000)
64.539 total actual memory usage
unreached in proctype P
(0 of 13 states)
unreached in claim alwaysten
_spin_nvr.tmp:8, state 10, "-end-"
(1 of 10 states)
pan: elapsed time 0.001 seconds
No errors found -- did you verify all claims?
This is because your LTL is translated into a claim with an assert statement. See the following automaton.
So, without checking for assertion violations, no error can be found.
(A possible explanation of different behaviors: previous versions of Spin might translate this differently, perhaps using accept instead of assert.)

Dot Net Cisco Command Line Console Parser [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm Trying to write a Cisco Command Line Parser to have an automated Graphical User Interface replacement for the Cisco console output. I have been able to get the ping time using Regular Expressions from a ping output and graph it, but am now stuck with more detailed out put of other commands like "Show interfaces" command,
any ideas how I can parse the Show Interface command output and extract all the useful info which i need?
Here is a "Show Interfaces" out put example:
FastEthernet0/0 is up, line protocol is up
Hardware is MV96340 Ethernet, address is 0018.189d.1df0 (bia 0018.189d.1df0)
Description: IP+ connection
Internet address is 164.128.251.50/24
MTU 1500 bytes, BW 100000 Kbit/sec, DLY 100 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 100Mb/s, 100BaseTX/FX
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:00, output 00:00:00, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/3718/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 2000 bits/sec, 6 packets/sec
5 minute output rate 3000 bits/sec, 10 packets/sec
152817108 packets input, 1043050554 bytes
Received 77347880 broadcasts (67140888 IP multicasts)
0 runts, 0 giants, 3351 throttles
381823 input errors, 0 CRC, 0 frame, 0 overrun, 381823 ignored
0 watchdog
0 input packets with dribble condition detected
--More-- 99065802 packets output, 440637782 bytes, 0 underruns
0 output errors, 0 collisions, 2 interface resets
300246 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier
0 output buffer failures, 0 output buffers swapped out
FastEthernet0/1 is administratively down, line protocol is down
Hardware is MV96340 Ethernet, address is 0018.189d.1df1 (bia 0018.189d.1df1)
MTU 1500 bytes, BW 100000 Kbit/sec, DLY 100 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Auto-duplex, Auto Speed, 100BaseTX/FX
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output never, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
0 packets input, 0 bytes
Received 0 broadcasts (0 IP multicasts)
--More-- 0 runts, 0 giants, 0 throttles
--More-- 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog
0 input packets with dribble condition detected
0 packets output, 0 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier
0 output buffer failures, 0 output buffers swapped out
Tunnel0 is up, line protocol is up
Hardware is Tunnel
Interface is unnumbered. Using address of FastEthernet0/0 (164.128.251.50)
MTU 17912 bytes, BW 100 Kbit/sec, DLY 50000 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation TUNNEL, loopback not set
Keepalive not set
Tunnel source 164.128.251.50 (FastEthernet0/0), destination 164.128.32.1
Tunnel Subblocks:
src-track:
Tunnel0 source tracking subblock associated with FastEthernet0/0
Set of tunnels with source FastEthernet0/0, 1 member (includes iterators), on interface
Tunnel protocol/transport PIM/IPv4
--More-- Tunnel TOS/Traffic Class 0xC0, Tunnel TTL 255
--More-- Tunnel transport MTU 1472 bytes
Tunnel is transmit only
Tunnel transmit bandwidth 8000 (kbps)
Tunnel receive bandwidth 8000 (kbps)
Last input never, output 28w1d, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/0 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
0 packets input, 0 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
44 packets output, 2464 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 output buffer failures, 0 output buffers swapped out
Virtual-Access1 is up, line protocol is up
Hardware is Virtual Access interface
Description: Internally created by SSLVPN context TEST
MTU 1406 bytes, BW 100000 Kbit/sec, DLY 100000 usec,
--More-- reliability 255/255, txload 1/255, rxload 1/255
--More-- Encapsulation SSL
Internal vaccess
Vaccess status 0x0, loopback not set
Keepalive set (10 sec)
DTR is pulsed for 5 seconds on reset
Last input never, output never, output hang never
Last clearing of "show interface" counters 29w5d
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
0 packets input, 0 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
0 packets output, 0 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 output buffer failures, 0 output buffers swapped out
0 carrier transitions
Interface_Long_Split = Regex.Split(Result_Long, "(POS[0-9]/[0-9]/[0-9])|(POS[0-9]/[0-9])|(GigabitEthernet[0-9]/[0-9])|(FastEthernet[0-9]/[0-9])")
Dim count As Integer = 0
For i = 0 To Interface_Long_Split.Length
If Regex.IsMatch(Interface_Long_Split(i), "(POS[0-9]/[0-9]/[0-9])|(POS[0-9]/[0-9])|(GigabitEthernet[0-9]/[0-9])|(FastEthernet[0-9]/[0-9])") = True Then
ReDim Preserve Interfaces_List(count)
Interfaces_List(count) = Interface_Long_Split(i)
count = count + 1
End If
imho you are probably on a hiding to nothing.
you could try parsing those complex outputs a line at a time rather than as one big blob.