cloudstack fail to start primary and second storage - virtual-machine

I use 2 hosts to establish my cloudstack cluster.All my hosts is Ubuntu 12.04 using NFSv3 and I use host1 as both the primary and secondary storage server.The management server is also in host1.I can mount host1's primary and second storage on host2,and I can access them through host2's mount points.But,when I try to add my first zone,I encountered the problem as show below,these messages can be found in my management-server.log:
2013-10-26 04:11:47,086 INFO [storage.secondary.SecondaryStorageManagerImpl] (secstorage-1:null) Unable to start secondary storage vm for standby capacity, secStorageVm vm Id : 28, will recycle it and start a new one
2013-10-26 04:11:47,086 INFO [cloud.secstorage.PremiumSecondaryStorageManagerImpl] (secstorage-1:null) Primary secondary storage is not even started, wait until next turn
2013-10-26 04:57:16,615 WARN [storage.secondary.SecondaryStorageManagerImpl] (secstorage-1:null) Exception while trying to start secondary storage vm
com.cloud.exception.AgentUnavailableException: Resource [Host:1] is unreachable: Host 1: Unable to start instance due to com.cloud.agent.api.Answer cannot be cast to com.cloud.agent.api.storage.PrimaryStorageDownloadAnswer
The log suggests that cloudstack fail to access host and start primary and secondary storages.I just don't how to test wheather the host is unreachable and how to start the primary and secondary storages.
It directly results in the two system vms stopped.I want to know some clues about how this problem occurs and some methods to debug.Any help will be appreicated!

Check that you have seeded secondary storage with the system VM template.
Prepare the System VM Template from the 4.2 Install Guide
The secondary storage system VM is responsible for adding templates to secondary storage. Without it, CloudStack cannot create new templates, which is why it has to be added manually.

Related

Naming rabbitmq node with a preconfigured name

I am setting up a Rabbitmq single node container built form a docker image. The Image is configured to persist to nfs mounted disc.
I ran into an issue when the image is restarted. Since every time a node restarted it gets unique name and the restarted node searching for the old nodes it’s reads from cluster_nodes.config file
Error dump shows:
Error during startup: {error,
{failed_to_cluster_with,
[rabbit#9c3bfb851ba3],
"Mnesia could not connect to any nodes."}}
How can I configure my image to use same name each time when it’s restarted instead of using arbitrary node name given by Kubernetes cluster?

Mule ESB: Is it possible to start 2 instances of the Mule ESB

I created two sepaerate directories in which I installed the Standalone Mule ESB server:
/ee/mmc-distribution-mule-console-bundle-3.5.2-HF1
/ee2/mmc-distribution-mule-console-bundle-3.5.2-HF1
I start up the first server, and below is the status:
[root#x240perf2 mmc-distribution-mule-console-bundle-3.5.2-HF1]# ./status.sh
MMC is running as PID=1998.
Mule Enterprise Edition is running as PID=2619.
Then I try to start the second instance:
[root#x240perf2 mmc-distribution-mule-console-bundle-3.5.2-HF1]# ./startup.sh
Port 8585 is in use, please make it available and try again.
So apparently the port 8585 is being used by the original instnace
So I stop the first instnace, and start the second istance, which comes up successfully, as follows:
./startup.sh
Please enter the desired port for Mule [Default 7777]:
Starting MMC, please wait...
class com.sun.jersey.multipart.impl.MultiPartConfigProvider
class com.sun.jersey.multipart.impl.MultiPartReader
class com.sun.jersey.multipart.impl.MultiPartWriter
[11-13 16:49:19] WARN HttpSessionSecurityContextRepository [http-bio-8585-exec-1]: Failed to create a session, as response has been committed. Unable to store SecurityContext.
[11-13 16:49:32] WARN HttpMethodBase [http-bio-8585-exec-12]: Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended.
[11-13 16:49:38] WARN HttpSessionSecurityContextRepository [http-bio-8585-exec-12]: Failed to create a session, as response has been committed. Unable to store SecurityContext.
Nov 13, 2014 4:49:50 PM org.apache.catalina.core.StandardServer await
INFO: A valid shutdown command was received via the shutdown port. Stopping the Server instance.
Nov 13, 2014 4:49:50 PM org.apache.coyote.AbstractProtocol pause
INFO: Pausing ProtocolHandler ["http-bio-8585"]
But notice it seems to be using 8585 for tomcat (of which I know little about, except it some sort of app server, never used it)
I examined this site:
http://www.mulesoft.org/documentation/display/33X/Running+Multiple+Mule+Instances
but it does nto discuss the issue., and the page it points do does not seem current. Did I misunderstand something
Is it possible to run two separate instances of Mule ESB at the same time
and if so, how ? (how would I change the port its using, what file should I modify)
Thanks
Edit: my second post in response to answer:
(BTW: I am using Mule ESB standalone Enterprise Edition 3.5.2)
To make sure I did not have any apps that were running
on port 8585, I shutdown my original instance, and created two new instances, and made sure no apps were deployed to either instance.
I brought up the first instance without issue, but the second instance I brought up still gives me the port 8585 in use error (from startup.sh)
This site says that the MMC default port is 7777, but the tomcat default port on which it runs is 8585
http://www.mulesoft.org/documentation/display/current/Setting+Up+MMC-Mule+ESB+Communications
I used the following command to find all files within my second instance of por t 8585
find . -type f |xargs grep "8585
Other than log files I got two hits
startup.sh
and
/mmc-3.5.2-HF1/apache-tomcat-7.0.52/conf/server.xml
I did NOT find in either instance the $MULE_HOME/apps/mmc/mule-config.xml (probably because I have no apps deployed)
In the server.xml, the MMC apparently uses tomcat to
handle the MMC applicaiton, and server.xml contains
the following:
<Connector port="8585" protocol="HTTP/1.1"
So I guess I could change 8585 to 8586 at this point, but ...
The startup.sh has serveral (about 9 or 10) hardcode dreferences to 8585 to check that the MMC is running and take action if it is or is not running
So do I actually have to change the entire startup.sh to replace 8585 with 8586 i the second instance as well as change the server.xml port 8585 reference ?
Thanks
You can run as many instances as you want, as long they don't use the same ports. Looks like you are deploying something in port 8585, so in the second instance you have to select a different port.
Is that port being used in any application that you developed and deployed in the Mule runtime?
Also, if you are using the Mule runtime with the MMC agent activated, you also have to change the port for the agent in the second instance. I think you can do that in the /conf/wrapper.conf or by passing to the startup script the following parameter:
-Dmule.mmc.bind.port=7778
(or any port that is free).
You can run as many as you want.
In MMC we can able to deploy and run many applications each applications has its own instance

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.

Amazon EC2 || RHEL || Connection refused on port 22 after reboot

I am aware that this question is asked many times in forums and I have tried all solutions mentioned in them, but no luck.
Actually, I doubt when last time I was trying to replace the /etc/sysconfig/iptables with my own iptables rules, I mistakenly replaced /etc/init.d/iptables and restarted the machine. And as expected it didn't start. Then I detached the EBS from this instance and attached to a new RHEL instance and fix the mess up by copying back the /etc/init.d/iptables from backup (I used to take backups before replacement :) ) and same for /etc/sysconfig/iptables.
I have also put some custom startup scripts in /etc/init.d folder for our application to start on instance reboot. I have removed those too to make sure any of my script is not causing this. But still system is not allowing me to connect via ssh. AWS console is showing 2/2 checks being successful, but not able to connect via 22.
Here is the last few lines of system log which states that something wrong is happening after or on iptables startup but not showing what. :(
blkfront: xvde1: barriers disabled
Changing capacity of (202, 65) to 62914560 sectors
xvde1: detected capacity change from 0 to 32212254720
EXT4-fs (xvde1): mounted filesystem with ordered data mode. Opts:
dracut: Mounted root filesystem /dev/xvde1
dracut: Loading SELinux policy
type=1404 audit(1398404320.826:2): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295
type=1403 audit(1398404321.795:3): policy loaded auid=4294967295 ses=4294967295
dracut:
dracut: Switching root
udev: starting version 147
Initialising Xen virtual ethernet driver.
microcode: CPU0 sig=0x306e4, pf=0x1, revision=0x415
platform microcode: firmware: requesting intel-ucode/06-3e-04
Microcode Update Driver: v2.00 <tigran#aivazian.fsnet.co.uk>, Peter Oruba
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
ip6_tables: (C) 2000-2006 Netfilter Core Team
nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
ip_tables: (C) 2000-2006 Netfilter Core Team
Can anyone help me in identifying what is going wrong here?
Got it fixed.
Actually, it was not the problem of iptables. Again it was due to the known bug in RHEL 6.4 on EC2 which puts wrong entries in sshd_config files. Although, I have checked this file for wrong entries in my first attempt to resolve the issue, somehow it was being created again, may be because every time I start a new machine using my AMI or new RHEL 6.4 AMI. In both cases, AMI is still registered as 6.4, though the OS on the disk is updated to 6.5. May be this was the reason that it was creating wrong entries in sshd_config. Now, again I have fixed this file for wrong entries and created new AMI using RHEL 6.5 and attached the EBS volume from instance created using my RHEL 6.4 AMI, it works fine.

Brisk TaskTracker not starting in a multi-node Brisk setup

I have a 3 node Brisk cluster (Briskv1.0_beta2). Cassandra is working fine (all three nodes see each other and data is balanced across the ring). I started the nodes with the brisk cassandra -t command. I cannot, however, run any Hive or Pig jobs. When I do, I get an exception saying that it cannot connect to the task tracker.
During the startup process, I see the following in the log:
TaskTracker.java (line 695) TaskTracker up at: localhost.localdomain/127.0.0.1:34928
A few lines later, however, I see this:
Retrying connect to server: localhost.localdomain/127.0.0.1:8012. Already tried 9 time(s).
INFO [TASK-TRACKER-INIT] RPC.java (line 321) Server at localhost.localdomain/127.0.0.1:8012 not available yet, Zzzzz...
Those lines are repeated non-stop as long as my cluster is running.
My cassandra.yaml file specifies the box IP (not 0.0.0.0 or localhost) as the listen_address and the rpc_address is set to 0.0.0.0
Why is the client attempting to connect to a different port than the log shows the task tracker as using? Is there anywhere these addresses/ports can be specified?
I figured this out. In case anyone else has the same issues, here's what was going on:
Brisk uses the first entry in the Cassandra cluster's seed list to pick the initial jobtracker. One of my nodes had 127.0.0.1 in the seed list. This worked for the Cassandra setup since all the other nodes in the cluster connected to that box to get the cluster topology but this didn't work for the job tracker selection.
looks like your jobtracker isn't running. What do you see when you run "brisktool jobtracker"?