RabbitMQ service start failure due to bad .erlang cookie location - rabbitmq

Erlang crashes on rabbitMQ service start, due to bad unable to create erlang.cookie path Image Attached (Log File). I need to figure out why it's adding an extra c: to the beginning of the path, and where that's defined. Any suggestions. I've uninstalled and deleted everything relating to rabbitMQ multiple times. include the registry items and environment variables.

Related

Why is "await Publish<T>" hanging / not completing / not finishing

The following piece of code has been working for some time and it has suddenly stopped returning:
await availableChangedPublishEndpoint
.Publish<IAvailableStockChanged>(
AvailableStockCounter.ConvertSkuQtyToAvailableStockChangedEvent(
newAvailable,
absMessage.Warehouse)
);
There is nothing clever in ConvertSkuQtyToAvailableStockChangedEvent - it just maps one simple class to another.
We added logs before and after this code and it's definitely just stopping at this point. Other systems are publishing fine, other messages are being sent from this application (for e.g. logs are actually sent via RabbitMQ). We have redeployed and we have upgraded to latest MassTransit version. We are seeing that the messages are being published - possibly multiple times, but this Publish method never returns.
We had a broken RabbitMQ node and a clean service restart on one node fixed it. I appreciate there might be other reasons for this behaviour, but this was our problem.
systemctl restart rabbitmq-server
Looking further into RabbitMQ we saw that some of the empty queues that were connected to this exchange were not synchronized (see below) and when we tried to synchronize them that wouldn't work.
We also couldn't delete some of these unsynchronized queues.
We believe an unexpected shutdown of one of the nodes had caused this problem - but it left most queues / exchanges completely OK.

RabbitMQ_MQTT failing to start

I am trying to enable mqtt in rabbitmq. Plugin has been enabled successfully but when I make the changes in the config for rabbitmq_mqtt, it fails to start the service. Even after googling a lot, I am not able to see the same issue being raised.
RabbitMQ_MQTT is failing to load even when the port is available.
Starting broker...
BOOT FAILED
===========
Error description:
{could_not_start,rabbitmq_mqtt,
{{function_clause,
[{rabbit_networking,tcp_listener_addresses,
[{1993}],
[{file,"src/rabbit_networking.erl"},{line,176}]},
{rabbit_mqtt_sup,'-listener_specs/3-lc$^0/1-0-',3,
[{file,"src/rabbit_mqtt_sup.erl"},{line,55}]},
{rabbit_mqtt_sup,init,1,
[{file,"src/rabbit_mqtt_sup.erl"},{line,47}]},
{supervisor2,init,1,[{file,"src/supervisor2.erl"},{line,305}]},
{gen_server,init_it,2,[{file,"gen_server.erl"},{line,365}]},
{gen_server,init_it,6,[{file,"gen_server.erl"},{line,333}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},
{rabbit_mqtt,start,[normal,[]]}}}
Log files (may contain more information):
/var/log/rabbitmq/rabbit.log
/var/log/rabbitmq/rabbit-sasl.log
{"init terminating in do_boot",{could_not_start,rabbitmq_mqtt,{{function_clause,[{rabbit_networking,tcp_listener_addresses,[{1993}],[{file,"src/rabbit_networking.erl"},{line,176}]},{rabbit_mqtt_sup,'-listener_specs/3-lc$^0/1-0-',3,[{file,"src/rabbit_mqtt_sup.erl"},{line,55}]},{rabbit_mqtt_sup,init,1,[{file,"src/rabbit_mqtt_sup.erl"},{line,47}]},{supervisor2,init,1,[{file,"src/supervisor2.erl"},{line,305}]},{gen_server,init_it,2,[{file,"gen_server.erl"},{line,365}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,333}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},{rabbit_mqtt,start,[normal,[]]}}}}
You need to check the log in /var/log/rabbitmq/startup_log or /var/log/rabbitmq/startup_err. It is very possible that your changes for the config file is causing the problem. Usually, it's the syntax of the config file causing the problem. If you are using the classic format, it's array like syntax, having extra comma or missing comma could also prevent you from starting the service.

NServiceBus - Service Pulse - Can't connect to ServiceControl (http://localhost:33333/api/)

I am using NServiceBus 5. My messages are sending / receiving correctly, but I'm having trouble with Service Pulse. I have configured the auditing using the default endpoint names.
When I navigate to Service Pulse (http://localhost:9090/) I get the following error.
Can't connect to ServiceControl (http://localhost:33333/api/)
Looking at my services I see that Particular ServiceControl is not started. When I attempt to start it, it starts and immediately stops.
I have checked the logs at:
%LOCALAPPDATA%\Particular\ServiceControl\logs
and
%WINDIR%\System32\config\systemprofile\AppData\Local\Particular\ServiceControl\logs
But apart from the errors about the missing queues from yesterday (see below) - nothing. When I attempt to restart the service now I get no errors.
Anyone know what I should do to get Service Pulse working correctly?
I deleted all my private queues yesterday thinking that they would be recreated automatically. Now I realise only the endpoint ones are recreated, I have recreated some manually.
Right now along with my endpoint queues I have:
audit
auditqueue
error
error.log
particular.servicecontrol
particular.servicecontrol.errors
particular.servicecontrol.retries
particular.servicecontrol.timeouts
particular.servicecontrol.timeoutsdispatcher
--- EDIT ---
Ended up just uninstalling and reinstalling - fixed the problem.
ServiceControl --uninstall
ServiceControl --install
Try and run ServiceControl --install in an admin Console and it will create the queues (C:\Program Files (x86)\Particular Software\ServiceControl> .\ServiceControl.exe --install)
If not you need to add these queues manually or reinstall ServiceControl:
particular.servicecontrol
particular.servicecontrol.errors
particular.servicecontrol.timeouts
particular.servicecontrol.timeoutsdispatcher

Could not connect to ActiveMQ Server - activemq for mcollective failing

We are continuously getting this error:
2014-11-06 07:05:34,460 [main ] INFO SharedFileLocker - Database activemq-data/localhost/KahaDB/lock is locked... waiting 10 seconds for the database to be unlocked. Reason: java.io.IOException: Failed to create directory 'activemq-data/localhost/KahaDB'
We have verified that activemq is running as activemq, we have verified that the owner of the directories are activemq. It will not create the directories automatically, and if we create them ourselves, it still gives the same error. The service starts fine, but it will just continuously spit out the same error. There is no lock file as it will not generate any files or directories.
Another way to fix this problem, in one step, is to create the missing symbolic link in /usr/share/activemq/. The permissions are already set properly on /var/cache/activemq/data/, but it seems the activemq RPM is not creating the symbolic link to that location as it should. The symbolic link should be as follows: /usr/share/activemq/activemq-data -> /var/cache/activemq/data/. After creating the symbolic link, restart the activemq service and the issue will be resolved.
I was able to resolve this by the following:
ensure activemq is owner and has access to /var/log/activemq and all sub dirs.
ensure /etc/init.d/activemq has: ACTIVEMQ_CONFIGS="/etc/sysconfig/activemq"
create file activemq in /etc/sysconfig if it doesnt exist.
add this line: ACTIVEMQ_DATA="/var/log/activemq/activemq-data/localhost/KahaDB"
The problem was that activeMQ 5.9.x was using /usr/share/activemq as its KahaDB location.

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.