Airflow with Celery not able to connect - rabbitmq

Every time I start airflow worker I keep getting this error
[2017-11-07 16:24:12,354: ERROR/MainProcess] consumer: Cannot connect to amqp://myuser:**#127.0.0.1:8793/myvhost: timed out.
Trying again in 26.00 seconds...
I have followed the instructions to install CeleryExecutors on Airflow as well as installing RabbitMQ using this documentation.
I have went and configured my airflow.cfg to reflect this by changing the celery_result_backend and broker_url to point to the right address (amqp://myuser:mypassword#localhost:8793/myvhost for example, from the documentation). I had this up and running at some point and when I changed the DAG directory. Changing the DAG directory shouldn't have an effect on it besides changing what's inside the DagBag.
Is there anything else I'm supposed to look at to debug and get the Celery Workers up and running?

Following the posts here and here, it was because I was running Celery4.1.0. I downgraded to Celery3.1.7 and it is now working.

Related

Web HUE is not getting loaded though HUE is workin on the port 8000

I have installed the Hue on the Linux whixh is an instance from Azure. I have made all the required changes in ambari and hue.ini conf file. And when I run the supervisor job, it runs fine
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
hue 83933 sshuser 3u IPv4 15707246 0t0 TCP *:8000 (LISTEN)
But when I try to access the wb hue, I don't get any page loaded. It shows refused to connect.
Tried deleting caches and reset up was done.
I am using hue 4.7 version and I don't find any issues in error.log file. Yet, I don't see any data in access.log file. Could you please help me?
Do you have
http_host=0.0.0.0
in the hue.ini?
#Ruthikajawar here is a working hue.ini for ambari
https://github.com/steven-dfheinz/HDP3-Hue-Service/blob/Hue.4.6.0/configuration/live.hue.ini
I have noticed that sometimes, after initial install, it takes 1 or 2 restarts to get the WEBUI to work. I have also noticed sometimes, after a restart, it takes quite a few moments before the WEBUI starts to respond.
Give it some time after restart and check the WEBUI. If you still are not getting it to answer you need to check /var/log/hue/error.log as it should be very specific with errors causing the WEBUI to fail on startup.

RabbitMQ SSL Clustering on Windows Breaks rabbitmqctl

I'm trying to follow the steps in the RabbitMQ docs here to get clustering with SSL working on Windows. I'm noticing though that the "rabbitmqctl status" command starts failing after the environment variables defined in those steps are set. I'm getting the following error when executing "rabbitmqctl status":
Error: unable to connect to node 'rabbit#server1': nodedown
I've already configured RabbitMQ to use TLS 1.2 and have verified that it's working. I've ensured that my Erlang 18 cookie is the same in the user directory C:\users\me and C:\Windows on the machine, but the error persists, and is stopping other servers from clustering with it. The docs say that the Windows SSL Cluster setup is "Coming soon"... Here are the steps I've taken so far on server1. I think that Erlang wants forward slashes in the paths - this matches the rabbit.config SSL settings.
Combined the contents of my server\cert.pem and server\key.pem into rabbit.pem via the command "type server\cert.pem server\key.pem > server\rabbit.pem"
Created environment variable ERL_SSL_PATH and set to: "C:/Program
Files/erl7.0/lib/ssl-7.0/ebin"
Created environment variable RABBITMQ_CTL_ERL_ARGS and set to: -pa "%ERL_SSL_PATH%" -proto_dist inet_tls -ssl_dist_opt server_certfile C:/OpenSSL-Win64/server/rabbit.pem -ssl_dist_opt server_secure_renegotiate true client_secure_renegotiate true
Created environment variable RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS and set to same value as RABBITMQ_CTL_ERL_ARGS
Copied the erlang cookie at C:\Windows.erlang.cookie to my local user profile directory.
Restarted rabbit using rabbitmq-service start
At this point, on server1, "rabbitmqctl status" no longer works. Attempts to try to join server2 to server1 result in a "node down" error.
Edit 1: I can't get the initial step in the docs working to ask Erlang to report its SSL directory on Windows in order to set ERL_SSL_PATH correctly. Erlang is installed at C:\Program Files\erl7.0 on my server.
Edit 2: Using werl.exe (at C:\Program Files\erl7.0\bin\werl.exe), I was able to issue a command "Foo=io:format(code:lib_dir(ssl, ebin))." and it reported the path as: c:/Program Files/erl7.0/lib/ssl-7.0/ebin. However, this doesn't seem to be the cause of the this issue since that's already what I was using.
Thanks,
Andy
For environment changes to take effect on Windows, the service must be
re-installed. It is not sufficient to restart the service. This can be
done using the installer or on the command line with administrator
permissions
(source)
This will do:
rabbitmq-service.bat stop
rabbitmq-service.bat remove
rabbitmq-service.bat install
rabbitmq-service.bat start
Also, if while the node you're working on is down, the other cluster nodes were running, their state might be assumed to have gone out of sync. In that case, the node might fail to start up and you might need to:
rabbitmqctl force_boot
Check the logs to confirm. (at %RABBIT_BASE%\log\rabbit#server.log)
Late answer but, hopefully this could help a searcher...

elasticsearch-mesos not getting listed under frameworks of mesosUI

Iam trying to run elasticsearch-mesos on mesos.My machine is running ubuntu 14.04. I have running mesos cluster installed with mesosphere packages by following these instructions. When I run test frameworks it gets lister under frameworks of mesosUI but for elasticsearch-mesos its not getting listed under mesos webUI. I want to run elasticsearch-mesos on top of mesos. I followed instructions given here. When I run ./elasticsearch-mesos I am getting a message in terminal
I0108 17:24:01.898540 23861 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I tried running ./elasticsearch-mesos on both mesos masters and slaves.
The last few lines of terminal output is given below
2015-01-08 17:24:01,881:23844(0x7f175bfff700):ZOO_INFO#zookeeper_init#786: Initiating
client connection, host=localhost:2181 sessionTimeout=10000 watcher=0x7f1762a3e6a0
sessionId=0 sessionPasswd=<null> context=0x7f1710002530 flags=0
I0108 17:24:01.881392 23858 sched.cpp:137] Version: 0.21.1
2015-01-08 17:24:01,881:23844(0x7f172b7fe700):ZOO_INFO#check_events#1703: initiated
connection to server [127.0.0.1:2181]
2015-01-08 17:24:01,897:23844(0x7f172b7fe700):ZOO_INFO#check_events#1750: session
establishment complete on server [127.0.0.1:2181], sessionId=0x14ac7c469270006,
negotiated timeout=10000
I0108 17:24:01.898455 23861 group.cpp:313] Group process (group(1)#127.0.1.1:38668)
connected to ZooKeeper
I0108 17:24:01.898509 23861 group.cpp:790] Syncing group operations: queue size (joins,
cancels, datas) = (0, 0, 0)
I0108 17:24:01.898540 23861 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
According to the README at https://github.com/mesosphere/elasticsearch-mesos,
you may need to modify mesos.master.url to point to the same ZK url that the Mesos master is using (maybe not localhost). If you're using a single-master Mesos cluster, you can skip the ZK url and point this parameter directly to the Mesos master.
Please also note that the elasticsearch framework is a bit outdated, so use with caution

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.

Brisk TaskTracker not starting in a multi-node Brisk setup

I have a 3 node Brisk cluster (Briskv1.0_beta2). Cassandra is working fine (all three nodes see each other and data is balanced across the ring). I started the nodes with the brisk cassandra -t command. I cannot, however, run any Hive or Pig jobs. When I do, I get an exception saying that it cannot connect to the task tracker.
During the startup process, I see the following in the log:
TaskTracker.java (line 695) TaskTracker up at: localhost.localdomain/127.0.0.1:34928
A few lines later, however, I see this:
Retrying connect to server: localhost.localdomain/127.0.0.1:8012. Already tried 9 time(s).
INFO [TASK-TRACKER-INIT] RPC.java (line 321) Server at localhost.localdomain/127.0.0.1:8012 not available yet, Zzzzz...
Those lines are repeated non-stop as long as my cluster is running.
My cassandra.yaml file specifies the box IP (not 0.0.0.0 or localhost) as the listen_address and the rpc_address is set to 0.0.0.0
Why is the client attempting to connect to a different port than the log shows the task tracker as using? Is there anywhere these addresses/ports can be specified?
I figured this out. In case anyone else has the same issues, here's what was going on:
Brisk uses the first entry in the Cassandra cluster's seed list to pick the initial jobtracker. One of my nodes had 127.0.0.1 in the seed list. This worked for the Cassandra setup since all the other nodes in the cluster connected to that box to get the cluster topology but this didn't work for the job tracker selection.
looks like your jobtracker isn't running. What do you see when you run "brisktool jobtracker"?