ContainerRequestState [INFO] No more pending requests in queue - hadoop-yarn

I am using a MapR (YARN) cluster with 3 nodes. I am trying to deploy 6 Samza jobs on the cluster for some processing on data streams. All jobs are correct. I tried deploying 2-3 in parallel and they work.
However when I deploy all the 6 Samza jobs in parallel I see following logs. The tasks continue to run and dont produce expected output data stream.
The status of the nodes on my ResourceManager web dashboard is as follows-
Can anyone suggest how can this be resolved. I think that maybe the application does not have sufficient resources to run all of them in parallel. What change can I try?

no more pending requests in queue.
This message means that still more messages in your Kafka Topic.

Related

One mule app server in cluster polling maximum message from MQ

My mule application is comprised of 2 nodes running in a cluster, and it listens to IBM MQ Cluster (basically connecting to 2 MQ via queue manager). There are situations where one mule node pulls or takes more than 80% of message from MQ cluster and another mule node picks rest 20%. This is causing CPU performance issues.
We have double checked that all load balancing is proper, and very few times we get CPU performance problem. Please can anybody give some ideas what could be possible reason for it.
Example: last scenario was created where there are 200000 messages in queue, and node2 mule server picked 92% of message from queue within few minutes.
This issue has been fixed now. Got into the root cause - our mule application running on MULE_NODE01 reads/writes to WMQ_NODE01, and similarly for node 2. One of the mule node (lets say MULE_NODE02) reads from linux/windows file system and puts huge messages to its corresponding WMQ_NODE02. Now, its IBM MQ which tries to push maximum load to other WMQ node to balance the work load. That's why MULE_NODE01 reads all those loaded files from WMQ_NODE01 and causes CPU usage alerts.
#JoshMc your clue helped a lot in understanding the issues, thanks a lot for helping.
Its WMQ node in a cluster which tries to push maximum load to other WMQ node, seems like this is how MQ works internally.
To solve this, we are now connecting our mule node to MQ gateway, rather making 1-to-1 connectivity
This could be solved by avoiding the racing condition caused by multiple listeners. Configure the listener in the cluster to the primary node only.
republish the message to a persistent VM queue.
move the logic to another flow that could be triggered via a VM listener and let the Mule cluster do the load balancing.

Distributed job management system

I'm using beeQueue for video transcoding job scheduling and processing
For now everything is fine and but I'm now facing challenge of working with distributed environment like auto scaling the amazon the instances for adding more workers to process more jobs which are pending in the queue, We scale well but need to implement a system which is fail safe, I mean in case a instance on which workers were processing the job has gone shutdown and we don't get job status or events, In that case the job which were running on that instance is gone into blackhole and can't be recovered and processed again.
What I did :
I'm looking up for ready made solution who works fail safe in distributed env.
Thanks

How to properly restart a kafka s3 sink connect?

I started a kafka s3 sink connector (bundle connector from confluent package) since 1 May. It works fine until 8 May. Checking the status, it tells that some aws exception crashes this connector. This should not be a big problem, so I want to restore it.
I tried the following steps:
I POST /connectors/s3sink/restart . Then I saw the connector is in RUNNING mode, but the task is still FAIL.
Then I PUT /connectors/s3sink/task/0/restart. Ok, now the task is in RUNNING mode.
But then I tail the log, I found it starts to rewrite the old data, such as 3 May data. And it messed the old data!
So, does connect restart REST API reset the offset? I thought it will save the offset and just start from the offset it fails.
And how to restart a failed connector task correctly? By deleting those PODs? (using kubernetes), or by REST /task/0/restart? When should I use /connectors/s3sink/restart?
/connector/:name/restart is a rolling restart operation on the worker leader that needs to propagate to all worker server tasks in async fashion. So, you need to ensure network connection between the leader worker and all others.
/connector/:name/task/:num/restart will send request straight to that worker, restarting the thread.
Restart should not reset the offset since they are stored in the consumer offsets topic for that connect cluster. If anything, the tasks were not able to commit offsets back to the __consumer_offsets topic, but you should see logs for that.

Amazon EMR managing my spark cluster

I have a spark setup on Amazon EC2 machines with 2 worker machines running. It reads data from cassandra, do some processing and write to sql server. I have heard about amazon EMR and read about it. I want a managed system where my worker machines are automatically added to my cluster if my job is taking more time and shutdown when my job gets completed.
Can I achieve this through Amazon EMR?
The requirements are:
My worker machines are automatically added to my cluster if my job is taking more time.
Shutdown when my job gets completed.
No. 2 is definitely possible if your job is launched from the steps. There is an option that auto-terminates cluster after the last step is completed. Alternatively, this could also be done programatically with the SDK.
No. 1 is a little more difficult but EMR has three classes of nodes; master, core, and task. Task nodes can be added after cluster creation. The trigger for that would probably have to be done programatically or utilizing another Amazon service, like Lambda.

Celery + RabbitMQ, some AMQP temporay queues not expiring

I have a celery + rabbitmq setup with a busy django site, in the celery setting I have this config:
CELERY_RESULT_BACKEND = "amqp"
CELERY_AMQP_TASK_RESULT_EXPIRES = 5
I am monitoring the queues with the "watch" command, what I have observed is whilst most of the temporary queues get deleled after a few seconds, there are some queues (same guid) did not get deleted, and the list grows slowly, regardless how many workers used.
The django site does generate about 60 tasks per second, accepts various information and use the tasks to digest information. The whole setup runs on a 16 core cpu server with plenty of RAM. Would this still caused by performance issue or a celery bug?
Cheers
James