Not able to delete RabbitMQ cluster on Google Cloud - rabbitmq

I have created RabbitMQ cluster via "click-to-deploy" from Google Cloud.
I tried deleting that cluster, but failed. I then manually deleted load-balancers and VM instances created, but still the cluster is hanged in half deletion stage.
Message received
-----------------
Your RabbitMQ cluster failed to delete
Jul 4, 2014, 9:58:45 PM
queueRamNodes: DELETING
queueDiskNodes: DELETING
rabbitMq-Queue-Nodes: DELETE_FAILED
Error deleting Load Balancer : Required 'WRITER' permission for 'projects/<projectname>'
rabbitMq-WebManagement: DELETE_FAILED
Error deleting HttpHealthCheck : Required 'WRITER' permission for 'projects/<projectname>'
statsNode: DELETING
rabbitMq-All-Nodes: DELETE_FAILED
Error deleting Load Balancer : Required 'WRITER' permission for 'projects/<projectname>'
Any help would be appreciated

It seems that the health check (rabbitmq-webmanagement-00131) may be causing the cluster deletion to fail.
Try deleting all rabbitmq-* target pools, forwarding rules and health checks using Google Developer Console -> Compute -> Compute Engine -> Network load balancing. Then try to delete the cluster and hopefully it will succeed.

Related

AWS DMS FATAL_ERROR Error with replicate-ongoing-changes only

I'm trying to migrate data from Aurora MySQL to S3. Since Aurora MySQL does not support replicating ongoing changes from cluster reader endpoint, my source endpoint is attached to cluster writer endpoint.
When I choose full-load migration only, DMS works. However, i get error Last Error Task 'courral-membership-s3-writer' was suspended after 9 successive recovery failures Stop Reason FATAL_ERROR Error Level FATAL when i choose full-load + ongoing replication or ongoing replication.
Thanks in advance.
This could be an error caused by - Replication instance class, you may need to upgrade it.

AWS EKS node group migration stopped sending logs to Kibana

I encounter a problem while using EKS with fluent bit and I will be grateful for the community help, first I'll describe the cluster.
We are running EKS cluster in a VPC that had an unmanaged node group.
The EKS cluster network configuration is marked as "public and private" and
using fluent-bit with Elasticsearch service we show logs in Kibana.
We've decided that we want to move to managed node group in that cluster and therefore migrated from the unmanaged node group to a managed node group successfully.
Since our migration we cannot see any logs in Kibana, when getting the logs manually from the fluent bit pods there are no errors.
I toggled debug level logs for fluent bit to get better look at it.
I can see that fluent-bit gathers all the log files and then I saw that we get messages:
[debug] [out_es] HTTP Status=403 URI=/_bulk
[debug] [retry] re-using retry for task_id=63 attemps=3
[debug] [sched] retry=0x7ff56260a8e8 63 in 321 seconds
Furthermore, we have managed node group in other EKS clusters but we did not migrate to them they were created with managed node group.
The created managed node group were created from the same template we have from working managed node group with the only difference is the compute power.
The template has nothing special in it except auto scale.
I compared between the node group IAM role of working node group logs and my non working node group and the Roles seems to be the same.
As far for my fluent bit configuration I have the same configuration in few EKS clusters and it works so I don't think that the root cause but if anyone thinks something else I can add it if requested.
Someone had that kind of problem? why node group migration could cause such issue?
Thanks in advance!
Lesson learned, always look at the access policy of the resource you are having issue with, maybe it does not match your node group role

Unstable cluster with hosts.xml resets to default

While performing management api operations such as removing app servers from a MarkLogic cluster, it becomes unstable resetting hosts.xml to default/localhost setting.
Logs shows something like:
MarkLogic: Slow send xx.xx.34.113:57692-xx.xx.34.170:7999, 4.605 KB in 1.529 sec; check host xxxx
Consider infrastructure is slow or not slow, but automatic recovery is still not happening.
How to overcome this situation?
Anyone who can provide more info on how management api is working under the hood?
Adding further details:
DELETE http://${BootstrapHost}:8002${serveruri}
POST http://${BootstrapHost}:8002${forest}?state=detach
DELETE http://${BootstrapHost}:8002${forest}?replicas=delete&level=full
DELETE http://${BootstrapHost}:8001/admin/v1/host-config?remote-host=${hostname}
When removing servers 1st request or removing host 4th request, few nodes in the cluster restarts and we check for nodes availability. However this is uncertain and sometimes hosts.xml resets to a default xml which says it is not part of any cluster.
How we fix, we copy hosts.xml from another host to this faulty host and it starts working again.
We found that it was very less likely to come in MarkLogic 8, but with MarkLogic 9 this problem is frequent and if it is on AWS it is even more frequent.

Could not connect to ActiveMQ Server - activemq for mcollective failing

We are continuously getting this error:
2014-11-06 07:05:34,460 [main ] INFO SharedFileLocker - Database activemq-data/localhost/KahaDB/lock is locked... waiting 10 seconds for the database to be unlocked. Reason: java.io.IOException: Failed to create directory 'activemq-data/localhost/KahaDB'
We have verified that activemq is running as activemq, we have verified that the owner of the directories are activemq. It will not create the directories automatically, and if we create them ourselves, it still gives the same error. The service starts fine, but it will just continuously spit out the same error. There is no lock file as it will not generate any files or directories.
Another way to fix this problem, in one step, is to create the missing symbolic link in /usr/share/activemq/. The permissions are already set properly on /var/cache/activemq/data/, but it seems the activemq RPM is not creating the symbolic link to that location as it should. The symbolic link should be as follows: /usr/share/activemq/activemq-data -> /var/cache/activemq/data/. After creating the symbolic link, restart the activemq service and the issue will be resolved.
I was able to resolve this by the following:
ensure activemq is owner and has access to /var/log/activemq and all sub dirs.
ensure /etc/init.d/activemq has: ACTIVEMQ_CONFIGS="/etc/sysconfig/activemq"
create file activemq in /etc/sysconfig if it doesnt exist.
add this line: ACTIVEMQ_DATA="/var/log/activemq/activemq-data/localhost/KahaDB"
The problem was that activeMQ 5.9.x was using /usr/share/activemq as its KahaDB location.

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.