How to solve race condition in etcd leader election? - race-condition

While testing a Core Os cluster with three nodes, after successfully adding and removing few additional nodes, I encountered the following problem, supposedly due to a race condition during the election process for etcd.
Checking the new leader gives:
$ curl -L http://127.0.0.1:4001/v2/stats/leader
{"errorCode":300,"message":"Raft Internal Error","index":629006}
Journalctl for each machine in the cluster gives:
$ journalctl -r -u etcd
-- Logs begin at Wed 2014-11-12 15:09:01 UTC, end at Mon 2014-11-24 10:47:34 UTC. --
Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24 10:47:34.307 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: term #5221 started.
Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24 10:47:34.306 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'candidate' to 'follower'.
Nov 24 10:47:33 node-1 etcd[56576]: [etcd] Nov 24 10:47:33.098 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to 'candidate'.
Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24 10:47:32.081 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: term #5219 started.
Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24 10:47:32.081 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'candidate' to 'follower'.
Nov 24 10:47:31 node-1 etcd[56576]: [etcd] Nov 24 10:47:31.962 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to 'candidate'.
And listing the machines with fleet fails:
$ fleetctl list-machines
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 200ms
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
Listing the machines in the cluster gives:
$ curl -L http://127.0.0.1:7001/v2/admin/machines
[{"name":"","state":"follower","clientURL":"http://100.72.62.35:4001","peerURL":"http://100.72.62.35:7001"},
{"name":"555cca74216644fea48990673b3d539c","state":"follower","clientURL":"http://100.72.62.59:4001","peerURL":"http://100.72.62.59:7001"},
{"name":"965d12d38a4a4b2c807bd232fb7b0db7","state":"follower","clientURL":"http://100.72.20.153:4001","peerURL":"http://100.72.20.153:7001"},
{"name":"a1b566dedb194c259f7eb2ffde5595b1","state":"follower","clientURL":"http://100.72.62.2:4001","peerURL":"http://100.72.62.2:7001"},
{"name":"a45efba827754b5f93c38b751a0ae273","state":"follower","clientURL":"http://100.72.62.31:4001","peerURL":"http://100.72.62.31:7001"},
{"name":"d041738235a9483cb814d37ca7fa4b6d","state":"follower","clientURL":"http://100.72.20.18:4001","peerURL":"http://100.72.20.18:7001"}]
but only three machines are currently running. I tried to add additional machines to reach the quorum with no avail.
I'm running the following version:
$ etcdctl -v
etcdctl version 0.4.6
for which, as mentioned here https://coreos.com/docs/distributed-configuration/etcd-api/#cluster-config, the leader module to force a leader has been removed. The ugly part is that since there is no quorum I'm not able to remove from the list of machines the ones that are not currently running using for example:
$ curl -L -XDELETE http://127.0.0.1:7001/v2/admin/machines/2abbf47a9e644bc69652a986d796d7a6
which has no effect. Is there any way to save the cluster?

In my understanding, you can save the cluster, but it isn't worth it.
The cluster is not accepting new machines because it needs a quorum to add new machines and there is not a quorum of existing machines. The same goes for removing machines and deleting keys.
If you can bring up enough machines listed as cluster members and have them successfully work as cluster members, you will have a quorum and save the cluster.
From what I can see, you have six machines listed as cluster members. You need to have at least four running for the existing cluster to operate.

Related

Unable to ssh VM after hardware configuration change

I followed the recommandation to reduce the size of my VM (number of CPU from 4 to 2 and memory from 16GO to 8 Go). After updating the configuration and restarting the VM i was not able to access the VM via ssh.
The VM has an external IP.
The troublshoot diagnostic using gcloud does not show any error or issue in the log. Everything is fine regarding the firewall configuration.
I tried to create a new VM under my project (same project as the original VM). I cannot access it with ssh. If i create a new project and a new VM instance under this new project then I can ssh it. --> The problem seems to be related to the project itself.
I tried to access vie serial port and I am getting these errors:
Mar 8 20:31:11 myvm systemd[1]: Started Google OSConfig Agent.
Mar 8 20:32:11 myvm OSConfigAgent[1173]: 2022-03-08T20:32:11.5643Z OSConfigAgent Critical main.go:100: Error parsing metadata, agent cannot start: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=0&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Mar 8 20:32:11 myvm systemd[1]: google-osconfig-agent.service: Main process exited, code=exited, status=1/FAILURE
Mar 8 20:32:11 myvm systemd[1]: google-osconfig-agent.service: Failed with result 'exit-code'.
Mar 8 20:32:12 myvm systemd[1]: google-osconfig-agent.service: Service hold-off time over, scheduling restart.
Mar 8 20:32:12 myvm systemd[1]: google-osconfig-agent.service: Scheduled restart job, restart counter is at 4.
I am blocked... I am asking for your support. Any idea or suggestion?

Cloudstack KVM installation failed

I'm installing cloudstack on ubuntu 20.04 by following this document.
I installed qemu-kvm and cloudstack-agent successfully but I'm not able to start libvirtd.service, on seeing the status I'm getting following errors
● libvirtd.service - Virtualization daemon
Loaded: loaded (/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-03-16 18:00:09 IST; 1min 28s ago
TriggeredBy: ● libvirtd-admin.socket
● libvirtd.socket
● libvirtd-ro.socket
Docs: man:libvirtd(8)
https://libvirt.org
Process: 232313 ExecStart=/usr/sbin/libvirtd $libvirtd_opts (code=exited, status=6)
Main PID: 232313 (code=exited, status=6)
Mar 16 18:00:09 host systemd[1]: libvirtd.service: Scheduled restart job, restart counter is at 5.
Mar 16 18:00:09 host systemd[1]: Stopped Virtualization daemon.
Mar 16 18:00:09 host systemd[1]: libvirtd.service: Start request repeated too quickly.
Mar 16 18:00:09 host systemd[1]: libvirtd.service: Failed with result 'exit-code'.
Mar 16 18:00:09 host systemd[1]: Failed to start Virtualization daemon.
on seeing the log of journalctl -xe it is showing cloudstack-usage.service: Failed with result 'exit-code'
can any one suggest what whould be the issue.
Are you trying this on a virtualised VM, or baremetal host, or on a raspberrypi? This means some other service hasn't started which libvirtd may depend on. See if you can run "systemctl daemon-reload" and try to start libvirtd manually "systemctl start libvirtd", and then try rest. The cloudstack-usage service can be started once the mysql server is running. If you've further questions I encourage you to join the CloudStack users mailing list and ask questions there - http://cloudstack.apache.org/mailing-lists.html
I got that same error message when following the official install guide when starting the mysql server. The problem was for me that [mysqld] was missing in the my.conf file before the config snippet. The documentation is misleading in that case (like the secion header is only relevant when editing that alternative mysql config file mentioned later there).

Redis timeout with almost no data in the database, using the .NET client

I received this error:
StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms),
next: GET RetryCount, inst: 3, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 7, in-pipe: 0, out-pipe: 0,
serverEndpoint: redis:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: 18745af38fec,
IOCP: (Busy=0,Free=1000,Min=1,Max=1000),
WORKER: (Busy=6,Free=32761,Min=1,Max=32767), v: 2.1.58.34321
(Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
We can see that there is only a single message in the queue (qs=1) and that there are only 7 bytes waiting to be read (in=7). Redis is used by 2 processes and holds settings for the system and store logs.
It was a re-install so no logs were written and the database has probably 2-3kb of data :)
This is the only output from Redis:
1:C 12 Sep 2020 15:20:49.293 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 12 Sep 2020 15:20:49.293 # Redis version=6.0.8, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 12 Sep 2020 15:20:49.293 # Configuration loaded
1:M 12 Sep 2020 15:20:49.296 * Running mode=standalone, port=6379.
1:M 12 Sep 2020 15:20:49.296 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 12 Sep 2020 15:20:49.296 # Server initialized
1:M 12 Sep 2020 15:20:49.296 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memor
y=1' for this to take effect.
1:M 12 Sep 2020 15:20:49.296 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepag
e/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
1:M 12 Sep 2020 15:20:49.305 * DB loaded from append only file: 0.000 seconds
1:M 12 Sep 2020 15:20:49.305 * Ready to accept connections
so it looks like nothing went wrong on that side.
The 2 processes accessing it are in docker containers, so does Redis. All on a single AWS instance with a lot of ram and disk available.
this is also a one time event, it has never happened before with the same config.
I'm not very experienced with Redis; is there anything in the error message that would look suspicious?

Radius server failed to start in centos 7

At beginning I successfully configured radius server with mariadb and httpd. But I changed to hostname of the server and rebooted. Now even if the mariadb and httpd is running but radiusd failed to start. Here is the answer from journalctl -xe .. Please help me.
Jan 10 12:34:08 cpe.twcny.res.rr.com systemd[1]: Unit radiusd.service entered failed state.
Jan 10 12:34:08 cpe.twcny.res.rr.com systemd[1]: radiusd.service failed.
Jan 10 12:34:08 cpe.twcny.res.rr.com polkitd[963]: Unregistered Authentication Agent for unix-process:2183:15540 (system bus name :1.43, object path /org/
Jan 10 12:40:01 cpe.twcny.res.rr.com systemd[1]: Created slice User Slice of root.

I changed disk limit of rabbitmq and now can't restart it.

I am very new to rabbitMQ. I get getting disk limit full error so I thought of changing the disk limit. I executed following command in sequence:
#1. rabbitmqctl set_disk_free_limit 1GB
#2. sudo systemctl stop rabbitmq-server.service
#3. sudo systemctl disable rabbitmq-server.service
#4. sudo systemctl start rabbitmq-server.service --> failing
I saw rabbitmq process was still running (by using top command). So I executed following:
#5. sudo service rabbitmq-server stop
#6 sudo service rabbitmq-server start --> failing
I am getting following error:
> Jul 11 15:13:37 sk-backend-vm rabbitmq-server[31810]:
> Starting broker... Jul 11 15:13:40 sk-backend-vm
> rabbitmq-server[31810]: erl_child_setup closed Jul 11 15:13:40
> sk-backend-vm rabbitmq-server[31810]: [1B blob data] Jul 11 15:13:40
> sk-backend-vm rabbitmq-server[31810]: Crash dump is being written to:
> /var/log/rabbitmq/erl_crash.dump...done Jul 11 15:13:40 sk-backend-vm
> systemd[1]: rabbitmq-server.service: Main process exited, code=exited,
> status=1/FAILURE Jul 11 15:13:40 sk-backend-vm systemd[1]: Stopped
> RabbitMQ broker. Jul 11 15:13:40 sk-backend-vm systemd[1]:
> rabbitmq-server.service: Unit entered failed
Can you please help me here. I am so stuck.