Error running topology in production cluster with Apache Storm 1.0.0, topology does not start - properties

I have a topology that runs well on a Local cluster.
But when I try to run it on a production cluster the following things happens:
The nimbus is up
The storm UI is up
The two workers I use are up
Zookeper is up
I run storm with
storm jar myjar.jar MyClass
Nimbus submits the topology
The topologies and the workers appears in the storm UI
BUT:
The topology does not start despite the fact that its status is ACTIVE
The log file of the topology does not appear in the workers.
I have the following log in the worker on the supervisor.log:
2016-04-15 13:18:19.831 o.a.s.d.supervisor [WARN] There was a connection problem with nimbus. #error {
:cause jobs-rec-storm-nimbus
:via
[{:type java.lang.RuntimeException
:message org.apache.storm.thrift.transport.TTransportException: java.net.UnknownHostException: jobs-rec-storm-nimbus
:at [org.apache.storm.security.auth.TBackoffConnect retryNext TBackoffConnect.java 64]}
{:type org.apache.storm.thrift.transport.TTransportException
:message java.net.UnknownHostException: jobs-rec-storm-nimbus
:at [org.apache.storm.thrift.transport.TSocket open TSocket.java 226]}
{:type java.net.UnknownHostException
:message jobs-rec-storm-nimbus
:at [java.net.AbstractPlainSocketImpl connect AbstractPlainSocketImpl.java 184]}]
:trace
[[java.net.AbstractPlainSocketImpl connect AbstractPlainSocketImpl.java 184]
[java.net.SocksSocketImpl connect SocksSocketImpl.java 392]
[java.net.Socket connect Socket.java 589]
[org.apache.storm.thrift.transport.TSocket open TSocket.java 221]
[org.apache.storm.thrift.transport.TFramedTransport open TFramedTransport.java 81]
[org.apache.storm.security.auth.SimpleTransportPlugin connect SimpleTransportPlugin.java 103]
[org.apache.storm.security.auth.TBackoffConnect doConnectWithRetry TBackoffConnect.java 53]
[org.apache.storm.security.auth.ThriftClient reconnect ThriftClient.java 99]
[org.apache.storm.security.auth.ThriftClient <init> ThriftClient.java 69]
[org.apache.storm.utils.NimbusClient <init> NimbusClient.java 106]
[org.apache.storm.utils.NimbusClient getConfiguredClientAs NimbusClient.java 78]
[org.apache.storm.utils.NimbusClient getConfiguredClient NimbusClient.java 41]
[org.apache.storm.blobstore.NimbusBlobStore prepare NimbusBlobStore.java 268]
[org.apache.storm.utils.Utils getClientBlobStoreForSupervisor Utils.java 462]
[org.apache.storm.daemon.supervisor$fn__9590 invoke supervisor.clj 942]
[clojure.lang.MultiFn invoke MultiFn.java 243]
[org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9351$fn__9369 invoke supervisor.clj 582]
[org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9351 invoke supervisor.clj 581]
[org.apache.storm.event$event_manager$fn__8903 invoke event.clj 40]
[clojure.lang.AFn run AFn.java 22]
[java.lang.Thread run Thread.java 745]]}
2016-04-15 13:18:19.831 o.a.s.d.supervisor [INFO] Finished downloading code for storm id jobs-KafkaMigration-topology-3-1460740616
2016-04-15 13:18:19.850 o.a.s.d.supervisor [INFO] Missing topology storm code, so can't launch worker with assignment ...(some more numbers)
So I asume that I have a connection problem with nimbus, but the properties file in the worker is:
storm.zookeeper.servers:
- "192.168.22.209"
- "192.168.22.216"
- "192.168.22.217"
storm.local.dir: "/app/home/storm"
storm.zookeeper.root: "/storm-prod"
#
nimbus.seeds: ["192.168.120.96"]
And if I make a ping to the nimbus ip from the workers, it returns OK
Where is the error, How can I fix it?
Thanks!

Whats appears to happen in this context is that Storm supervisor resolves nimbus from whatever is configured in storm.yaml seeds/host the first time and from then on uses nimbus host name to download the topology artifacts.
If that is correct, DNS is mandatory for a cluster setup. This is far from ideal, specially when using containers in an orchestrated environment like kubernetes.
Current workaround i'm using is adding
storm.local.hostname: "<local.ip.value>"
to the storm.yaml
Thanks to #bastien who provided the tip on storm user mailing list

I ran into the similar issue. Turns out my firewall rules were blocking the supervisor ports. Make sure the supervisor and nimbus are able to talk to each other.

I found that I need to have the hostnames of the boxes match what I was calling them in the /etc/hosts file
in host file i had
xxx.xxx.xxx.xxx nimbus
but the host name on the box was different and it was pulling the hostname from the os
changing the host name on the os of the nimbus server resolved my issue.

Related

Kafka-s3-connect killed instantly after start

I want to connect aws-Kafka with s3 using confluence connector on my ec2 server. I try to configure everything like in tutorials. When I run connect-standalone or connect-distributed, at first everything goes well, I don't get any errors in the logs but after information about connection starting, my connector died instantly without any information. Has anybody got same problem?
config/connect-standalone.properties
bootstrap.servers=msk-connection-string
plugin.path=/home/ubuntu/connectors/confluentinc-kafka-connect-s3
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
offset.storage.file.filename=/tmp/connect.offsets
connector.properties
connector.class=io.confluent.connect.s3.S3SinkConnector
format.class=io.confluent.connect.s3.format.bytearray.ByteArrayFormat
flush.size=1
topics=SomeTopic
s3.bucket.name=bucket-name-here
s3.region=us-west-2
s3.part.size=5242880
aws.access.key.id=****
aws.secret.access.key=****
behavior.on.null.values=ignore
storage.class=io.confluent.connect.s3.storage.S3Storage
topics.dir=../topics
store.url=http://bucket-name.s3-website-Region.amazonaws.com
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
logs:
[2021-08-20 06:32:35,954] INFO Kafka version: 2.7.0 (org.apache.kafka.common.utils.AppInfoParser:119)
[2021-08-20 06:32:35,954] INFO Kafka commitId: 448719dc99a19793 (org.apache.kafka.common.utils.AppInfoParser:120)
[2021-08-20 06:32:35,954] INFO Kafka startTimeMs: 1629441155953 (org.apache.kafka.common.utils.AppInfoParser:121)
Killed
Please help!
MSK requires TLS connection
When adding few lines with ssl configuration to config/connect-standalone.properties
producer.security.protocol=SSL
consumer.security.protocol=SSL
security.protocol=SSL
ssl.protocol=TLS
ssl.truststore.location=/your/path/to/truststore/kafka.client.truststore.jks
It starts working properly!

Open MQ broker won't start

We're running Glassfish 4.1.1 (Payara) with mq 5.1.1. It's a HA setup with load balancer and cluster.
Glassfish is running ok. Problem is that MQ won't start.
I think that a remote MQ is starting. I can do imqcmd list bkr -b and I get successful results.
However when I do imqcmd list bkr (or imqcmd list jmx, without -b hostname) I get:
Host Primary Port
-------------------------
localhost 7676
WARNING: [C4003]: Error occurred on connection creation [localhost:7676]. - cause: java.net.SocketException: Connection reset
Error while connecting to the broker on host 'localhost' and port '7676'.
I'd like to get rid of the error, and see my network ip instead of localhost.
Also GF server.log gives this:
[2017-04-12T11:54:46.516-0400] [Payara 4.1] [SEVERE] [rardeployment.start_failed] [javax.enterprise.resource.resourceadapter.com.sun.enterprise.connectors] [tid: _ThreadID=42 _ThreadName=admin-listener(2)] [timeMillis: 1492012486516] [levelValue: 1000] [[
RAR6035 : Resource adapter start failed.
javax.resource.spi.ResourceAdapterInternalException: java.security.PrivilegedActionException: javax.resource.spi.ResourceAdapterInternalException: MQJMSRA_RA4001: start:Aborting:Exception starting EMBEDDED broker=Broker failed to start
at com.sun.enterprise.connectors.jms.system.ActiveJmsResourceAdapter.startResourceAdapter(ActiveJmsResourceAdapter.java:557)
at com.sun.enterprise.connectors.ActiveOutboundResourceAdapter.init(ActiveOutboundResourceAdapter.java:130)
...
Caused by: java.lang.RuntimeException: Broker failed to start
at com.sun.messaging.jmq.jmsclient.runtime.impl.BrokerInstanceImpl.start(BrokerInstanceImpl.java:205)
at com.sun.messaging.jms.blc.EmbeddedBrokerRunner.start(EmbeddedBrokerRunner.java:331)
at com.sun.messaging.jms.blc.LifecycleManagedBroker.start(LifecycleManagedBroker.java:457)
... 92 more
Caused by: java.io.IOException: [B3297]: Unable to make directory <mydirectory>/imq/instances/imqbroker/etc
at com.sun.messaging.jmq.jmsserver.Broker.initializePasswdFile(Broker.java:376)
I'm wondering where the directory that it is unable to make is configured.
I've been debugging this for days. I need to know where to configure the ip for the embedded broker. I also need to know where to set up the jmxrmi url.
any help would be appreciated. Thanks!
I found the solution to this problem. We had a broken symlink to the openmq application directory, within the Glassfish application directory. On domain startup, Glassfish could not find mq and therefore could not start the embedded broker. Once we fixed the symlink, the embedded broker started up on glassfish domain startup (asadmin start-domain).
I knew the embedded broker was not starting because the "imq" folder was not being created in <domaindir>/
Check for those broken symlinks!!

activemq failover using multiple instances in master slave mode on same linux machine

I have setup ActiveMQ mulitple instances to achieve failover in master slave mode in windows.
While setting up the same i just created 3 instances under bin folder without changing any port and started all 3 instances one by one. First instance became master and remaining were in slave mode until I stopped master instance.
Now I am trying to achieve the same in Linux environment. First instance starts successfully but when I start second instance in a different window it throws below error:
ERROR | Failed to start Apache ActiveMQ ([instance2, ID:132vm6-57227-1478597606120-0:1], java.io.IOException: Transport Connector could not be registered in JMX: java.io.IOException: Failed to bind to server socket: tcp://0.0.0.0:61616?maximumConnections=1000&wireFormat.maxFrameSize=104857600 due to: java.net.BindException: Address already in use)
INFO | Apache ActiveMQ 5.14.0 (instance2, ID:132vm6-57227-1478597606120-0:1) is shutting down
INFO | Connector openwire stopped
INFO | Connector amqp stopped
INFO | Connector stomp stopped
INFO | Connector mqtt stopped
INFO | Connector ws stopped
INFO | PListStore:[/opt/apache-activemq-5.14.0/bin/instance2/data/instance2/tmp_storage] stopped
INFO | Stopping async queue tasks
INFO | Stopping async topic tasks
INFO | Stopped KahaDB
INFO | Apache ActiveMQ 5.14.0 (instance2, ID:132vm6-57227-1478597606120-0:1) uptime 0.585 seconds
INFO | Apache ActiveMQ 5.14.0 (instance2, ID:132vm6-57227-1478597606120-0:1) is shutdown
INFO | Closing org.apache.activemq.xbean.XBeanBrokerFactory$1#4233871a: startup date [Tue Nov 08 15:03:24 IST 2016]; root of context hierarchy
WARN | Exception thrown from LifecycleProcessor on context close
java.lang.IllegalStateException: LifecycleProcessor not initialized - call 'refresh' before invoking lifecycle methods via the context: org.apache.activemq.xbean.XBeanBrokerFactory$1#4233871a: startup date [Tue Nov 08 15:03:24 IST 2016]; root of context hierarchy
at org.springframework.context.support.AbstractApplicationContext.getLifecycleProcessor(AbstractApplicationContext.java:357)[spring-context-4.1.9.RELEASE.jar:4.1.9.RELEASE]
at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:884)[spring-context-4.1.9.RELEASE.jar:4.1.9.RELEASE]
at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:843)[spring-context-4.1.9.RELEASE.jar:4.1.9.RELEASE]
at org.apache.activemq.hooks.SpringContextHook.run(SpringContextHook.java:30)[activemq-spring-5.14.0.jar:5.14.0]
at org.apache.activemq.broker.BrokerService.stop(BrokerService.java:875)[activemq-broker-5.14.0.jar:5.14.0]
at org.apache.activemq.xbean.XBeanBrokerService.stop(XBeanBrokerService.java:122)[activemq-spring-5.14.0.jar:5.14.0]
at org.apache.activemq.broker.BrokerService.start(BrokerService.java:629)[activemq-broker-5.14.0.jar:5.14.0]
at org.apache.activemq.xbean.XBeanBrokerService.afterPropertiesSet(XBeanBrokerService.java:73)[activemq-spring-5.14.0.jar:5.14.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)[:1.7.0_65]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)[:1.7.0_65]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)[:1.7.0_65]
at java.lang.reflect.Method.invoke(Method.java:606)[:1.7.0_65]
I am using ActiveMQ 5.14 version.
If anybody has encountered a similar issue, kindly provide your inputs.
To get multiple instances of ActiveMQ running on the same machine, you need to change the ports that they try to open. There are (at least) 3 ports that need to be changed:
The transportConnector ports that accept messaging traffic. These are defined in theactivemq.xml file. Typically you only need the openwire one - this is 61616 by default; I usually change this in the other ActiveMQ instances to 61626, 61636 etc. You can usually comment out the others if you don't intend to use them.
The Jetty HTTP port. This is defined in the jetty.xml file. The default is 8161, set the next ones to 8162, 8163 etc.
The JMX port. This one's a bit tricky, as you need to stick a piece of config into the activemq.xml to explicitly define it as follows:
<managementContext>
<managementContext createConnector="true" connectorPort="1099"/>
</managementContext>
You can then change this to 1199, 1299 on the other instances. Hope this helps.

Liferay 6.2 clustering issue with multicast

I am trying to cluster ehcache and lucene with Liferay 6.2 EE sp2 bundle on 2 servers with mutlicast enabled. WE have Apache HTTPD servers fronting tomcat servers using reverse proxy. A valid 6.2 license is deployed on both the nodes.
We user the following properties in the portal-ext.properties:
cluster.link.enabled=true
lucene.replicate.write=true
ehcache.cluster.link.replication.enabled=true
# Since we are using SSL on the frontend
web.server.protocol=https
# set this to any server that is visible to both the nodes
cluster.link.autodetect.address=dbserverip:dbport
#ports and ips we know work in our environment for multicast
multicast.group.address["cluster-link-control"]=ip
multicast.group.port["cluster-link-control"]=port1
multicast.group.address["cluster-link-udp"]=ip
multicast.group.port["cluster-link-udp"]=port2
multicast.group.address["cluster-link-mping"]=ip
multicast.group.port["cluster-link-mping"]=port3
multicast.group.address["hibernate"]=ip
multicast.group.port["hibernate"]=port4
multicast.group.address["multi-vm"]=ip
multicast.group.port["multi-vm"]=port5
We are running into issues with the ehcache and lucene clustering not working. The following tests fail :
Moving a portlet on node 1, does not show up on node 2
There are no errors except for a startup error with lucene.
14:19:35,771 ERROR
[CLUSTER_EXECUTOR_CALLBACK_THREAD_POOL-1][LuceneHelperImpl:1186]
Unable to load index for company 10157
com.liferay.portal.kernel.exception.SystemException:
java.net.ConnectException: Connection refused at
com.liferay.portal.search.lucene.LuceneHelperImpl.getLoadIndexesInputStreamFromCluster(LuceneHelperImpl.java:488)
at
com.liferay.portal.search.lucene.LuceneHelperImpl$LoadIndexClusterResponseCallback.callback(LuceneHelperImpl.java:1176)
at
com.liferay.portal.cluster.ClusterExecutorImpl$ClusterResponseCallbackJob.run(ClusterExecutorImpl.java:614)
at
com.liferay.portal.kernel.concurrent.ThreadPoolExecutor$WorkerTask._runTask(ThreadPoolExecutor.java:682)
at
com.liferay.portal.kernel.concurrent.ThreadPoolExecutor$WorkerTask.run(ThreadPoolExecutor.java:593)
at java.lang.Thread.run(Thread.java:745) Caused by:
java.net.ConnectException: Connection refused at
java.net.PlainSocketImpl.socketConnect(Native Method) at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at
java.net.Socket.connect(Socket.java:579) at
sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:625) at
sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:160)
at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at
sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at
sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at
sun.net.www.protocol.https.HttpsClient.(HttpsClient.java:275)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:371)
We verified that the jgroups multicast works outside of liferay by running the following commands and using a downloaded copy of the jgroups.jar and replacing with the 5 multicast ips and ports.
Testing with JGROUPS
1) McastReceiver -
java -cp ./jgroups.jar org.jgroups.tests.McastReceiverTest -mcast_addr 224.10.10.10 -port 5555
ex. java -cp jgroups-final.jar org.jgroups.tests.McastReceiverTest -mcast_addr 224.10.10.10 -port 5555
2) McastSender -
java -cp ./jgroups.jar org.jgroups.tests.McastSenderTest -mcast_addr 224.10.10.10 -port 5555
ex. java -cp jgroups-final.jar org.jgroups.tests.McastSenderTest -mcast_addr 224.10.10.10 -port 5555
From there, typing things into the McastSender will result in the Receiver printing it out.
Thanks!
After a lot of troubleshooting and help from various folks in my team and at liferay support, we switched to using unicast and it worked a lot better.
Here is what we did:
Extracted jgroups.jar from the tomcat home/webappts/ROOT/WEB_INF/lib, saved locally.
Unzipped the jgroups.jar file and extracted and save the tcp.xml from the jar's WEB_INF folder
As a base line test, changed the section in the tcp.xml and saved
TCPPING timeout="3000"
initial_hosts="${jgroups.tcpping.initial_hosts:servername1[7800],servername2[7800]}"
port_range="1"
num_initial_members="10"
Copy the tcp.xml to the liferay home on both the nodes
Change the portal-ext.properties to remove the mutlicast properties and add the following lines.
cluster.link.channel.properties.control=${liferay.home}/tcp.xml
cluster.link.channel.properties.transport.0=${liferay.home}/tcp.xml
Start node 1
start node 2
check logs
Do the cluster cache test:
Moving a portlet on node 1, shows up on node 2
Under control panel -> License manager both the nodes show up with valid licenses.
searching for user on node 2 after adding in node 1 in control panel -> user and organizations.
All of the above tests worked.
So we shutdown servers and changed the tcp.xml to use jdbc rather than the tcpping so we don't have to specify node names manually.
Step for the jdbc config:
Create the table in the liferay database manually.
CREATE TABLE JGROUPSPING (own_addr varchar(200) not null, cluster_name varchar(200) not null, ping_data blob default null, primary key (own_addr, cluster_name))
change tcp.xml and remove the tcpping section and add the following.
Note: Please replace the leading \ with less than symbol in the following code block. There are issues with the leading less than sign in the SO editor/parser hiding whatever comes after it:
\JDBC_PING datasource_jndi_name="java:comp/env/jdbc/LiferayPool"
initialize_sql="" />
Save and push the file manually to both the nodes.
Start the servers and repeat tests above.
It should work seamlessly.
It was invaluable to have the debug logging on for jgroups mentioned in the following the post:
https://bitsofinfo.wordpress.com/2014/05/21/clustering-liferay-globally-across-data-centers-gslb-with-jgroups-and-relay2/
tomcat home/webapps/ROOT/WEB-INF/classes/META-INF/portal-log4j-ext.xml file I used to triage various issues on bootup related to clustering.
<?xml version="1.0"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
<category name="com.liferay.portal.cluster">
<priority value="TRACE" />
</category>
<category name="com.liferay.portal.license">
<priority value="TRACE" />
</category>
We also found that the Lucene cluster replication startup errors were fixed in a fix pack and are getting a patch for it.
https://issues.liferay.com/browse/LPS-51714
https://issues.liferay.com/browse/LPS-51428
We added the following portal instance properties for lucene replication to work better between the 2 nodes:
portal.instance.http.port=port that the app servers listen on ex. 8080
portal.instance.protocol=http
Hope this helps someone.
Update
The lucene index load in a cluster issue was resolved by a Liferay 6.2 EE patch from support for the LPS's mentioned above.

How can I change which address Datastax agent will try to connect to?

My Cassandra instances are not listening on 127.0.0.1. When I start datastax-agent I find this in logs:
# tail -n 100 /var/log/datastax-agent/agent.log
...
ERROR [Initialization] 2015-05-19 22:35:04,064 Can't connect to Cassandra, retrying soon.
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.TransportException: [/127.0.0.1:9042] Cannot connect))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:220)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:78)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1231)
at com.datastax.driver.core.Cluster.init(Cluster.java:158)
at com.datastax.driver.core.Cluster.connect(Cluster.java:246)
at clojurewerkz.cassaforte.client$connect_or_close.doInvoke(client.clj:149)
at clojure.lang.RestFn.invoke(RestFn.java:410)
at clojurewerkz.cassaforte.client$connect.invoke(client.clj:165)
at opsagent.cassandra$setup_cassandra$fn__8157.invoke(cassandra.clj:344)
at again.core$with_retries_STAR_$fn__8013.invoke(core.clj:98)
at again.core$with_retries_STAR_.invoke(core.clj:97)
at opsagent.cassandra$setup_cassandra.invoke(cassandra.clj:339)
at opsagent.opsagent$setup_cassandra.invoke(opsagent.clj:153)
at opsagent.jmx$determine_ip.invoke(jmx.clj:276)
at opsagent.jmx$setup_jmx$fn__8438.invoke(jmx.clj:293)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:745)
How can I change which address the Datastax Agent connects to? I have tried setting local_interface in the agent's address.yaml (and restarting agent), but that doesn't seem to work.
The secret was to set rpc_address to 0.0.0.0. Cred to LHWizard for pointing this out.