neutron-linuxbridge-agent oslo_service.service amqp.exceptions.InternalError: Connection.open: (541) INTERNAL_ERROR - rabbitmq

Openstack Train version's neutron-linuxbridge-agent component's log show error:
2022-03-17 14:38:36.727 6 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.6/site-packages/amqp/connection.py", line 648, in _on_close
2022-03-17 14:38:36.727 6 ERROR oslo_service.service (class_id, method_id), ConnectionError)
2022-03-17 14:38:36.727 6 ERROR oslo_service.service amqp.exceptions.InternalError: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'openstack': vhost '/' is down
2022-03-17 14:38:36.727 6 ERROR oslo_service.service
2022-03-17 14:38:36.729 6 INFO neutron.plugins.ml2.drivers.agent._common_agent [-] Stopping Linux bridge agent agent.
docker logs neutron_linuxbridge_agent get:
++ /usr/bin/update-alternatives --query iptables
update-alternatives: error: no alternatives for iptables
++ . /usr/local/bin/kolla_neutron_extend_start
+ echo 'Running command: '\''neutron-linuxbridge-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini'\'''
+ exec neutron-linuxbridge-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
Running command: 'neutron-linuxbridge-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini'
All openstack network agent list show state are UP, but Alive are XXX.
What's the problem with my cluster, and how could I fixed that? Thanks a lot.
The key server is rabbitmq reference of amqp.exceptions.InternalError, and the rabbit#node-3.log shows:
2022-03-18 06:50:35.270 [error] <0.21119.0> Error on AMQP connection <0.21119.0> (1.1.1.2:12345 -> 1.1.1.3:55672 - neutron-linuxbridge-agent:7:11111111-1111-1111-1111-111111111111, vhost: 'none', user: 'openstack', state: opening), channel 0:
{handshake_error,opening,
{amqp_error,internal_error,
"access to vhost '/' refused for user 'openstack': vhost '/' is down",
'connection.open'}}

While check and login the rabbitmq server site(http://1.1.1.3:15672/), I get this error tip:
rabbitmq virtual host experienced an error on node and may be inaccessible
Solve it by:
1, come in the rabbitmq container, and remove or move out recovery.dets file in directory /var/lib/rabbitmq/mnesia/rabbit#node-3/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L.
2, restart rabbitmq container.
Because of:
In RabbitMQ versions starting with 3.7.0 all messages data is combined in the msg_stores/vhosts directory and stored in a subdirectory per vhost. Each vhost directory is named with a hash and contains a .vhost file with the vhost name, so a specific vhost's message set can be backed up separately.
In RabbitMQ versions prior to 3.7.0 messages are stored in several directories under the node data directory: queues, msg_store_persistent and msg_store_transient. Also there is a recovery.dets file which contains recovery metadata if the node was stopped gracefully.
My whole cluster was reboot by accident, it was recoveried by this method.

if you wanna fix your problem easily please deploy your Rabbimq again with Kolla-ansible.
kolla-ansible -i <INVENTORY> deploy -t rabbitmq -vvvv
it's my experience that the easiest way with the lowest cost of fixing Rabbimq or oslo problem in OpenStack is to redeploy Rabbitmq and invest your time.

Related

Unable to start second rabbitmq node on single Windows host

I am trying to run two rabbitmq nodes on a single windows host. My end goal is to run two rabbitmq services.
currently, I have the following commands for the second node in rabbitmq-env-conf.bat :
set RABBITMQ_CONFIG_FILE=%APPDATA%\RabbitMQ\rabbitmq.conf
set RABBITMQ_NODENAME=rabbit2#hostname
set RABBITMQ_DIST_PORT=5673
set RABBITMQ_SERVER_START_ARGS="-rabbitmq_management listener [{port,15673}]"
Running .\rabbitmq-server.bat start produces the following error :
. . .
Starting broker...Logger - error: {removed_failing_handler,rabbit_log}
BOOT FAILED
===========
Error during startup: {error,
{rabbitmq_management,
{bad_return,
{{rabbit_mgmt_app,start,[normal,[]]},
{'EXIT',
{{could_not_start_listener,
[{cowboy_opts,[{sendfile,false}]},{port,15672}],
{shutdown,
{failed_to_start_child,ranch_acceptors_sup,
{listen_error,
{acceptor,{0,0,0,0,0,0,0,0},15672},
eaddrinuse}}}},
. . .
From the log :
Application rabbitmq_management exited with reason: {{could_not_start_listener,[{cowboy_opts,[{sendfile,false}]},{port,15672}],{shutdown,{failed_to_start_child,ranch_acceptors_sup,{listen_error,{acceptor,{0,0,0,0,0,0,0,0},15672},eaddrinuse}}}},{gen_server,call,[rabbit_web_dispatch_registry,{add,rabbitmq_management_tcp,[{cowboy_opts,[{sendfile,false}]},{port,15672}],#Fun<rabbit_web_dispatch.0.73002970>,[{'_',[],[{[],[],rabbit_mgmt_wm_static,{priv_file,rabbitmq_management,"www/index.html"}},{[<<"api">>,<<"overview">>],[],rabbit_mgmt_wm_overview,...},...]}],...},...]}}
It looks like I am unable to setup the rabbitmq management port successfully despite suppling starting args.
15672 is the first rabbitmq's management port number and I am not sure why this number is being picked up.
Some troubleshooting pointers will be appreciated.

ECS logging (awslogs driver) only logging apache server startup logs to cloudwatch, no error.log & no access.log

I have the problem that my ECS logs (awslogs driver) are not working as expected. In Cloudwatch I'm only seeing the server startup logs & not the useful logs from the apache (/var/log/apache2/error.log & /var/log/apache2/access.log)
I have a docker multicontainer setup with one container running the apache server & the other container running PHP-FPM. My container logs on cloudwatch look like this:
Apache-Container:
23:35:39 *** Running /etc/my_init.d/02_init.sh...
23:35:39 Starting Apache
23:35:39 * Starting Apache httpd web server apache2
23:35:39 /usr/sbin/apache2ctl: 87: ulimit: error setting limit (Operation not permitted)
23:35:39 Setting ulimit failed. See README.Debian for more information.
23:35:40 *** Running /etc/rc.local...
23:35:40 *** Booting runit daemon...
23:35:40 *** Runit started as PID 225
23:35:40 Oct 25 22:35:40 apache-container syslog-ng[231]: syslog-ng starting up; version='3.5.6'
2019-10-26 00:17:01
Oct 25 23:17:01 apache-container CRON[947]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
...
07:35:16 tail: '/var/log/syslog' has been replaced; following new file
...
FPM-Container:
...
10:25:23 172.x.x.x - 27/Okt/2019:09:25:23 +0000 "GET /app.php" 200
10:25:25 172.x.x.x - 27/Okt/2019:09:25:24 +0000 "GET /app.php" 200
...
I've checked various forums & online resources. As I understood it right I just need to symlink my logs to STDOUT/STDERR or even better to /proc/self/fd/1 & /proc/self/fd/2 like this:
ln -sf /dev/stdout /var/log/apache2/access.log
ln -sf /dev/stderr /var/log/apache2/error.log
I've tried to link the logs in my .Dockerfile via the RUN command & also during runtime, but no success. I see that my logs are showing up correctly in the log files before linking them. I've also tried things like echo "test stderr logs" >> /dev/stderr or echo "test stdout logs" >> /dev/stdout inside & outside the containers, but nothing showing up in the cloudwatch logs. When I try docker logs MY_DOCKER_CONTAINER_ID I get: Error response from daemon: configured logging driver does not support reading.
Maybe I'm missing some basic knowledge here. I see that syslog is in my environment/base image (maybe i need to merge syslog & apache logs?) and that the PHP-FPM-container is logging 200's but only to the app.php even though I would like to know the exact path of the accessed url.
You need to specify in your docker-compose used by ECS to use the cloudwatch logging driver like so:
version: '2'
services:
myapp:
build:
context: .
logging:
driver: awslogs
options:
awslogs-group: "/my/log/group"
awslogs-region: "us-west-2"
awslogs-stream-prefix: some-prefix
This should cause /dev/stdout and /dev/stderr to appear in CloudWatch. You can find more information on the logging driver options on the Docker page.
Hey thx for the responses. If I remember it right, the problem was, that all of my output was redirected to syslog & there was a misconfiguration in my syslog config.

Node not starting after creating a new node in rabbitmq

I want to create a cluster of 3 nodes. I have created two nodes with command:
RABBITMQ_NODE_PORT=5680 RABBITMQ_NODENAME=rabbit1#localhost rabbitmq-server -detached
Now when i try to stop the node in order to join it to cluster, it gives me error stating the node is not started at all.
What i have done till now is installed rabbitmq and started it using rabbitmq-server.
rabbit1#localhost.log
Error description:
init:do_boot/3
init:start_em/1
rabbit:start_it/1 line 480
rabbit:broker_start/0 line 356
rabbit:start_apps/2 line 575
app_utils:manage_applications/6 line 126
lists:foldl/3 line 1263
rabbit:'-handle_app_error/1-fun-0-'/3 line 696
throw:{could_not_start,rabbitmq_mqtt,
{rabbitmq_mqtt,
{{shutdown,
{failed_to_start_child,'rabbit_mqtt_listener_sup_:::1883',
{shutdown,
{failed_to_start_child,
{ranch_listener_sup,{acceptor,{0,0,0,0,0,0,0,0},1883}},
{shutdown,
{failed_to_start_child,ranch_acceptors_sup,
{listen_error,
{acceptor,{0,0,0,0,0,0,0,0},1883},
eaddrinuse}}}}}}},
{rabbit_mqtt,start,[normal,[]]}}}}
Log file(s) (may contain more information):
/usr/local/var/log/rabbitmq/rabbit1#localhost.log
/usr/local/var/log/rabbitmq/rabbit1#localhost_upgrade.log
Terminal:
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit1#localhost
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: [rabbit1#localhost]
rabbit1#localhost:
* connected to epmd (port 4369) on localhost
* epmd reports: node 'rabbit1' not running at all
other nodes on localhost: [rabbit]
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-9206-rabbit#localhost'
* effective user's home directory: /Users/yashparekh
* Erlang cookie hash: +/3SPQl4T2w3zA11j1+o4Q==
I expect stop_app command to work in order to be able to join it to cluster.
Please let me know where i'm going wrong.
Thanks in advance.
{failed_to_start_child,
{ranch_listener_sup,{acceptor,{0,0,0,0,0,0,0,0},1883}},
{shutdown,
{failed_to_start_child,ranch_acceptors_sup,
{listen_error,
{acceptor,{0,0,0,0,0,0,0,0},1883},
eaddrinuse}}}}}}},
it means that the port 1883 (the MQTT port) is already used. you have to set also this port dynamically.

Open MQ broker won't start

We're running Glassfish 4.1.1 (Payara) with mq 5.1.1. It's a HA setup with load balancer and cluster.
Glassfish is running ok. Problem is that MQ won't start.
I think that a remote MQ is starting. I can do imqcmd list bkr -b and I get successful results.
However when I do imqcmd list bkr (or imqcmd list jmx, without -b hostname) I get:
Host Primary Port
-------------------------
localhost 7676
WARNING: [C4003]: Error occurred on connection creation [localhost:7676]. - cause: java.net.SocketException: Connection reset
Error while connecting to the broker on host 'localhost' and port '7676'.
I'd like to get rid of the error, and see my network ip instead of localhost.
Also GF server.log gives this:
[2017-04-12T11:54:46.516-0400] [Payara 4.1] [SEVERE] [rardeployment.start_failed] [javax.enterprise.resource.resourceadapter.com.sun.enterprise.connectors] [tid: _ThreadID=42 _ThreadName=admin-listener(2)] [timeMillis: 1492012486516] [levelValue: 1000] [[
RAR6035 : Resource adapter start failed.
javax.resource.spi.ResourceAdapterInternalException: java.security.PrivilegedActionException: javax.resource.spi.ResourceAdapterInternalException: MQJMSRA_RA4001: start:Aborting:Exception starting EMBEDDED broker=Broker failed to start
at com.sun.enterprise.connectors.jms.system.ActiveJmsResourceAdapter.startResourceAdapter(ActiveJmsResourceAdapter.java:557)
at com.sun.enterprise.connectors.ActiveOutboundResourceAdapter.init(ActiveOutboundResourceAdapter.java:130)
...
Caused by: java.lang.RuntimeException: Broker failed to start
at com.sun.messaging.jmq.jmsclient.runtime.impl.BrokerInstanceImpl.start(BrokerInstanceImpl.java:205)
at com.sun.messaging.jms.blc.EmbeddedBrokerRunner.start(EmbeddedBrokerRunner.java:331)
at com.sun.messaging.jms.blc.LifecycleManagedBroker.start(LifecycleManagedBroker.java:457)
... 92 more
Caused by: java.io.IOException: [B3297]: Unable to make directory <mydirectory>/imq/instances/imqbroker/etc
at com.sun.messaging.jmq.jmsserver.Broker.initializePasswdFile(Broker.java:376)
I'm wondering where the directory that it is unable to make is configured.
I've been debugging this for days. I need to know where to configure the ip for the embedded broker. I also need to know where to set up the jmxrmi url.
any help would be appreciated. Thanks!
I found the solution to this problem. We had a broken symlink to the openmq application directory, within the Glassfish application directory. On domain startup, Glassfish could not find mq and therefore could not start the embedded broker. Once we fixed the symlink, the embedded broker started up on glassfish domain startup (asadmin start-domain).
I knew the embedded broker was not starting because the "imq" folder was not being created in <domaindir>/
Check for those broken symlinks!!

Error running topology in production cluster with Apache Storm 1.0.0, topology does not start

I have a topology that runs well on a Local cluster.
But when I try to run it on a production cluster the following things happens:
The nimbus is up
The storm UI is up
The two workers I use are up
Zookeper is up
I run storm with
storm jar myjar.jar MyClass
Nimbus submits the topology
The topologies and the workers appears in the storm UI
BUT:
The topology does not start despite the fact that its status is ACTIVE
The log file of the topology does not appear in the workers.
I have the following log in the worker on the supervisor.log:
2016-04-15 13:18:19.831 o.a.s.d.supervisor [WARN] There was a connection problem with nimbus. #error {
:cause jobs-rec-storm-nimbus
:via
[{:type java.lang.RuntimeException
:message org.apache.storm.thrift.transport.TTransportException: java.net.UnknownHostException: jobs-rec-storm-nimbus
:at [org.apache.storm.security.auth.TBackoffConnect retryNext TBackoffConnect.java 64]}
{:type org.apache.storm.thrift.transport.TTransportException
:message java.net.UnknownHostException: jobs-rec-storm-nimbus
:at [org.apache.storm.thrift.transport.TSocket open TSocket.java 226]}
{:type java.net.UnknownHostException
:message jobs-rec-storm-nimbus
:at [java.net.AbstractPlainSocketImpl connect AbstractPlainSocketImpl.java 184]}]
:trace
[[java.net.AbstractPlainSocketImpl connect AbstractPlainSocketImpl.java 184]
[java.net.SocksSocketImpl connect SocksSocketImpl.java 392]
[java.net.Socket connect Socket.java 589]
[org.apache.storm.thrift.transport.TSocket open TSocket.java 221]
[org.apache.storm.thrift.transport.TFramedTransport open TFramedTransport.java 81]
[org.apache.storm.security.auth.SimpleTransportPlugin connect SimpleTransportPlugin.java 103]
[org.apache.storm.security.auth.TBackoffConnect doConnectWithRetry TBackoffConnect.java 53]
[org.apache.storm.security.auth.ThriftClient reconnect ThriftClient.java 99]
[org.apache.storm.security.auth.ThriftClient <init> ThriftClient.java 69]
[org.apache.storm.utils.NimbusClient <init> NimbusClient.java 106]
[org.apache.storm.utils.NimbusClient getConfiguredClientAs NimbusClient.java 78]
[org.apache.storm.utils.NimbusClient getConfiguredClient NimbusClient.java 41]
[org.apache.storm.blobstore.NimbusBlobStore prepare NimbusBlobStore.java 268]
[org.apache.storm.utils.Utils getClientBlobStoreForSupervisor Utils.java 462]
[org.apache.storm.daemon.supervisor$fn__9590 invoke supervisor.clj 942]
[clojure.lang.MultiFn invoke MultiFn.java 243]
[org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9351$fn__9369 invoke supervisor.clj 582]
[org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9351 invoke supervisor.clj 581]
[org.apache.storm.event$event_manager$fn__8903 invoke event.clj 40]
[clojure.lang.AFn run AFn.java 22]
[java.lang.Thread run Thread.java 745]]}
2016-04-15 13:18:19.831 o.a.s.d.supervisor [INFO] Finished downloading code for storm id jobs-KafkaMigration-topology-3-1460740616
2016-04-15 13:18:19.850 o.a.s.d.supervisor [INFO] Missing topology storm code, so can't launch worker with assignment ...(some more numbers)
So I asume that I have a connection problem with nimbus, but the properties file in the worker is:
storm.zookeeper.servers:
- "192.168.22.209"
- "192.168.22.216"
- "192.168.22.217"
storm.local.dir: "/app/home/storm"
storm.zookeeper.root: "/storm-prod"
#
nimbus.seeds: ["192.168.120.96"]
And if I make a ping to the nimbus ip from the workers, it returns OK
Where is the error, How can I fix it?
Thanks!
Whats appears to happen in this context is that Storm supervisor resolves nimbus from whatever is configured in storm.yaml seeds/host the first time and from then on uses nimbus host name to download the topology artifacts.
If that is correct, DNS is mandatory for a cluster setup. This is far from ideal, specially when using containers in an orchestrated environment like kubernetes.
Current workaround i'm using is adding
storm.local.hostname: "<local.ip.value>"
to the storm.yaml
Thanks to #bastien who provided the tip on storm user mailing list
I ran into the similar issue. Turns out my firewall rules were blocking the supervisor ports. Make sure the supervisor and nimbus are able to talk to each other.
I found that I need to have the hostnames of the boxes match what I was calling them in the /etc/hosts file
in host file i had
xxx.xxx.xxx.xxx nimbus
but the host name on the box was different and it was pulling the hostname from the os
changing the host name on the os of the nimbus server resolved my issue.