Broken pipe error on query from aerospike - aerospike

i have namespace "test" and set "demo"
when i run "select * from test.demo" in aql terminal, i got this error. What exactly causes broken pipe?
and i got a warn message in server log below.
and my aerospike.conf is:
service {
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
proto-fd-max 15000
}
logging {
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode multicast
multicast-group 239.1.99.222
port 9918
# To use unicast-mesh heartbeats, remove the 3 lines above, and see
# aerospike_mesh.conf for alternative.
interval 150
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace test {
replication-factor 2
memory-size 4G
default-ttl 30d # 30 days, use 0 to never expire/evict.
storage-engine memory
}
namespace bar {
replication-factor 2
memory-size 4G
default-ttl 30d # 30 days, use 0 to never expire/evict.
storage-engine memory
# To use file storage backing, comment out the line above and use the
# following lines instead.
# storage-engine device {
# file /opt/aerospike/data/bar.dat
# filesize 16G
# data-in-memory true # Store data in memory in addition to file.
# }
}
somebody can figure out the reason?

I think you are getting a socket error when trying to send the scan result to a socket that has already timedout on the client side.
Error: (-10) Socket read error: 11, [::1]:3000, 36006
By default the aql timeout is set to 1000ms
It could be bumped up to 100000ms using the -T command line option. (or using set timeout within the aql interactive mode)
aql -T 100000
-T, --timeout <ms> Set the timeout (ms) for commands. Default: 1000
This option is equivalent to setting TotalTimeout on other clients.
Setting the timeout higher should help, but doesn't answer why a basic scan would take so long.
Here is an example with setting different client timeouts, this shows the clients timing out prior to the scan result being received. In the logs you would see the TCP send error for scan.
WARNING (proto): (proto.c:693) send error - fd 32 Broken pipe
Details from aql console:
aql> set timeout 10
TIMEOUT = 10
aql> select * from test.demo
Error: (-10) Socket read error: 11, 127.0.0.1:3000, 58496
aql> select * from test.demo
Error: (-10) Socket read error: 115, 127.0.0.1:3000, 58498
aql> set timeout 100
TIMEOUT = 100
aql> select * from test.demo
Error: (-10) Socket read error: 115, 127.0.0.1:3000, 58492
aql> set timeout 1000
TIMEOUT = 1000
aql> select * from test.demo
+-----+-------+
| foo | bar |
+-----+-------+
| 123 | "abc" |
+-----+-------+
1 row in set (0.341 secs)
Its still a mystery why your aql client would timeout for returning 1 record, if default timeout was kept at 1000ms. Did you by any chance modify the timeout. Or have a huge number of records in the test namespace with null sets.

Related

IPVS (keepalived) doesn't balance UDP connections

I have two load balancer with Debian 8 and three Graylog server with Debian 9.
Every server in my network sends logs via rsyslog to a virtual server configured on the LB. The connection is UDP.
The problem is that the packets are not balanced. (all connections goes on the first real server on the list)
In case of failover the packets are correctly sent to the others real servers.
The only way I found to re-balance the connection is to remove all real server from the LB and the restart keepalived service.
I already tied to set:
ipvsadm --set 0 0 1
Timeout (tcp tcpfin udp): 900 120 1
I already set these two variables:
echo 1 > /proc/sys/net/ipv4/vs/expire_nodest_conn
echo 1 > /proc/sys/net/ipv4/vs/expire_quiescent_template
IPVS is configure as follow:
vrrp_instance logserver {
state MASTER
interface eth0
virtual_router_id 195
priority 200
advert_int 1
authentication {
auth_type keepalived
auth_pass xxxxxx
}
virtual_ipaddress {
10.20.20.195/22
}
}
virtual_server 10.20.20.195 0 {
delay_loop 60
protocol UDP
lb_algo wrr
lb_kind DR
persistence_timeout 30
real_server 10.20.20.196 0 {
weight 100
MISC_CHECK {
connect_timeout 3
misc_path "/etc/keepalived/checkgraylog 10.20.20.196"
}
}
real_server 10.20.20.197 0 {
weight 100
MISC_CHECK {
connect_timeout 3
misc_path "/etc/keepalived/checkgraylog 10.20.20.197"
}
}
real_server 10.20.20.198 0 {
weight 100
MISC_CHECK {
connect_timeout 3
misc_path "/etc/keepalived/checkgraylog 10.20.20.198"
} } }
Is there a way to effective balance UDP connection with Direct Routing?
Thank you
virtual_server 10.20.20.195 12333 {
delay_loop 60
protocol UDP
lb_algo wrr
lb_kind DR
ops # <<< - Try this. Works for me (Ubuntu 18.04, Keepalived v1.3.9, ipvsadm v1.28)
real_server 10.20.20.196 12333 {
Option ops for me works only if either:
Virtual server port is explicitly defined.
fwmark is used together with in virtual_server definition.
Does not work for virtual_server_IP 0 form - in that case ipvsadm -Ln shows that persistent option is used as well.

ASP.NET Core SignalR websocket connection limit

I produce load testing of SignalR (ASP.NET Core) application hosted at Windows Server 2016 standard using Microsoft.AspNetCore.SignalR.Client.
Dotnet core hosting 2.1.1 installed
And i can not create more than 3000 (2950-3050) connections.
Already tried recomendations as described here:
How to configure concurrency in .NET Core Web API?
Limiting performance factors of WebSocket in ASP.NET 4.5?
Set limit concurrent connections for websocket on iis 8
Added limits to UseKestrel (this seems to work if i set values to 100 or 1000):
var host = new WebHostBuilder()
.UseKestrel(options =>
{
options.Limits.MaxConcurrentConnections = 50000;
options.Limits.MaxConcurrentUpgradedConnections = 50000;
})
Changed all aspnet.config files by adding this:
<system.web>
<applicationPool maxConcurrentRequestsPerCPU="50000" />
</system.web>
Executed this command:
cd %windir%\System32\inetsrv\ appcmd.exe set config /section:system.webserver/serverRuntime /appConcurrentRequestLimit:50000
Added performance counter for Web Service\Current Connections - Maximum Connections. And Maximum Connections increases to 3300 and stops.
There are no exceptions in server logs. But I feel that there are some restrictions in system.
Server IIS logs contains only this:
GET /messageshub
id=A_3x1sH9kHM1Rc3oPSgP6w
80 - 172.20.192.11 - - 404 0 0 3
Client exceptions is basically the following:
System.Net.Http.HttpRequestException: Error while copying content to a
stream. ---> System.IO.IOException: Unable to read data from the
transport connection: An existing connection was forcibly closed by
the remote host.
On Windows you may have dynamic port assignment issue .
Windows by default has 5000 port numbers ready to be assigned to TCP connections and 1024 of them are reserved for the OS itself which you will end up with 3977 ports free to be assigned .
In your case the number is 3300 as you mentioned but it's possible that 3300 of the connections are established and 677 of them are Time_Waited.
In any case i recommend to use
netstat -an | find 'Established" -c
netstat -an | find 'TIME" -c
netstat -an | find 'CLOSED" -c
In order to figure out the number of established & time_wait & close_wait connections at the time you received the IO exception and if the number is close to 5000 just add this to your registry and reboot and test again
[HKEY_LOCAL_MACHINE \System \CurrentControlSet \Services \Tcpip \Parameters]
MaxUserPort = 5000 (Default = 5000, Max = 65534)

Aerospike heartbeat configuration for single server, error "Unable to find any suitable network device for node ID"

I want to run Aerospike server in single-server mode.
Now I have this configuration:
service {
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
service-threads 4
transaction-queues 4
transaction-threads-per-queue 4
proto-fd-max 15000
}
logging {
console {
context any info
}
}
network {
service {
address 127.0.0.1
port 3000
}
heartbeat {
mode multicast
multicast-group 239.1.99.222
port 9918
# To use unicast-mesh heartbeats, remove the 3 lines above, and see
# aerospike_mesh.conf for alternative.
interval 150
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace test {
replication-factor 1
memory-size 20M
default-ttl 1d # 30 days, use 0 to never expire/evict.
storage-engine memory
}
And when I try to start server I got error in the log:
"Unable to find any suitable network device for node ID"
I don't want server to be available to internet.
How to achieve this and fix the issue?
The Node ID is generated using the MAC id of the interface on the host.
https://github.com/aerospike/aerospike-server/blob/master/cf/src/socket.c#L2470
If you dont have any of the default interface names that aerospike is aware of, then you might get this error.
To fix this problem, you can specify your interface name.
http://www.aerospike.com/docs/operations/troubleshoot/startup#problem-with-network-interface
To avoid exposing your aerospike node on internet, you can bind it only to localhost or to a private interface only or use other network tools/devices to avoid exposing the server port such as firewall or ACL. Best way to avoid exposing aerospike on internet is to ensure that the server hosting aerospike is not exposed to internet. If that is not doable then restrict your aerospike port access to your aerospike clients IP only using firewall. Also, you can use database credentials available in enterprise edition.
http://www.aerospike.com/docs/guide/security.html

Aerospike - “n_bytes_memory went negative” with in memory-only namespace with ttl

We have a namespace configured to store data in memory only with couple of minutes default ttl. After starting putting some data into it, when expiration kicks in, we're getting these messages in the log (a lot, for ~30% of expired records):
WARNING (namespace): (namespace.c::762) set_id 1 - n_bytes_memory went negative!
I have simple client app with server config that can reproduce this: https://github.com/akkomar/aerospike-test (it's based on docker and is very easy to start)
Any advice what might be the reason?
Edit:
I checked this on versions 3.6.4, 3.7.0.1 and 3.7.4
Configuration file used for testing (from https://github.com/akkomar/aerospike-test/blob/master/etc/aerospike.conf):
service {
user root
group root
paxos-single-replica-limit 1
pidfile /var/run/aerospike/asd.pid
service-threads 4
transaction-queues 4
transaction-threads-per-queue 4
proto-fd-max 1024
}
logging {
file /var/log/aerospike/aerospike.log {
context any info
}
console {
context any info
context namespace detail
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 3002
mesh-port 3002
interval 150
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace test_ns {
replication-factor 2
memory-size 1G
default-ttl 10S
storage-engine memory
}
Edit2:
It seems that it's happening only if I update records via UDF. The simplest one that reproduces this:
local VAL_KEY = "v"
function add_data(rec, val_to_add, ttl_to_set)
if aerospike:exists(rec) then
rec[VAL_KEY] = val_to_add
aerospike:update(rec)
else
rec[VAL_KEY] = val_to_add
aerospike:create(rec)
end
end
When I execute the same operation via Java API - everything seems to work fine (example github repo mentioned earlier is updated with Java API example)
The meaning of the error message is that the space we have accounted for the set in memory went to a negative number which should not be possible.
This has been logged in our internal bug tracking system for resolution in future releases
It turned out it was a bug in Aerospike.
It's fixed in version 3.7.4.1 (detailed explanation in https://discuss.aerospike.com/t/problem-with-expiring-records-in-memory-only-namespace-n-bytes-memory-went-negative/2560/6)

Monit cannot open connection errors resulting in alerts from M/Monit that server is down

I'm using monit and M/Monit to monitor my application infrastructure. But every once in a while, M/Monit will show a "No report" error from a server and mark it down. A few seconds later, the issue clears at the next check in for the server to M/Monit.
The monit logs on some of the servers have these events in them:
Oct 14 12:19:11 ip-10-203-51-199 monit[30307]: M/Monit: cannot open a
connection to http://example.com:8080/collector -- Connection timed out
Oct 14 12:20:16 ip-10-203-51-199 monit[30307]: M/Monit: cannot open a
connection to http://example.com:8080/collector -- Connection timed out
Oct 14 12:22:21 ip-10-203-51-199 monit[30307]: M/Monit: cannot open a
connection to http://example.com:8080/collector -- Connection timed out
What config do I need to tune to increase the threshold until M/Monit considers the server actually down?
Here is the config from the server that has the most trouble:
set httpd port 2812 and
allow xxx:xxx
set mailserver xxx.xxx.xxx port xxx username "xxx" password "xxx" using tlsv1 with timeout 15 seconds
set daemon 30
with start delay 120
set logfile syslog facility log_daemon
set alert xxx
set mail-format {
subject: $EVENT $SERVICE on $HOST
from: monit#$HOST
message: Monit $ACTION $SERVICE at $DATE on $HOST: $DESCRIPTION.
}
set mmonit http://xxx:xxx#example.com:8080/collector
There doesn't appear to be any problem with config file.
The intermittent problem you are experiencing is because monit is failing to open a socket on the port and timing out. See the source code for reference (handle_mmonit()):
http://fossies.org/linux/privat/monit-5.6.tar.gz:a/monit-5.6/src/collector.c
Search for the string "M/Monit: cannot open a connection to".
The timeout value appears to be fixed at 5 seconds in the code. But 5 seconds is ample time to open a socket connection on that port.
How often does monit post events to mmonit?
Had the same problem
[MST Apr 5 11:24:11] error : 'apache' failed protocol test [APACHESTATUS] at [phoenix.example.com]:80 [TCP/IP] -- APACHE-STATUS: error -- no scoreboard found
[MST Apr 5 11:24:16] error : Cannot create socket to [10x.xx.xx.x4]:8080 -- Connection timed out
We had another firewall on top of iptables. Opened up the 8080 in the input and the output side and it fixed it!