Erlang monitor multiple processes - process

I need to monitor a bunch of worker processes. Currently I'm able to monitor 1 process through 1 monitor. How do i scale this to monitoring N worker processes. Do i need to spawn N monitors as well? If so then what happens if one of those spawned monitors failed/crashed?

Do i need to spawn N monitors as well?
No:
-module(mo).
-compile(export_all).
worker(Id) ->
timer:sleep(1000 * rand:uniform(5)),
io:format("Worker~w: I'm still alive~n", [Id]),
worker(Id).
create_workers(N) ->
Workers = [ % { {Pid, Ref}, Id }
{ spawn_monitor(?MODULE, worker, [Id]), Id }
|| Id <- lists:seq(1, N)
],
monitor_workers(Workers).
monitor_workers(Workers) ->
receive
{'DOWN', Ref, process, Pid, Why} ->
Worker = {Pid, Ref},
case is_my_worker(Worker, Workers) of
true ->
NewWorkers = replace_worker(Worker, Workers, Why),
io:format("Old Workers:~n~p~n", [Workers]),
io:format("New Workers:~n~p~n", [NewWorkers]),
monitor_workers(NewWorkers);
false ->
monitor_workers(Workers)
end;
_Other ->
monitor_workers(Workers)
end.
is_my_worker(Worker, Workers) ->
lists:keymember(Worker, 1, Workers).
replace_worker(Worker, Workers, Why) ->
{{Pid, _}, Id} = lists:keyfind(Worker, 1, Workers),
io:format("Worker~w (~w) went down: ~s~n", [Id, Pid, Why]),
NewWorkers = lists:keydelete(Worker, 1, Workers),
NewWorker = spawn_monitor(?MODULE, worker, [Id]),
[{NewWorker, Id}|NewWorkers].
start() ->
observer:start(), %%In the Processes tab, you can right click on a worker and kill it.
create_workers(4).
In the shell:
$ ./run
Erlang/OTP 19 [erts-8.2] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V8.2 (abort with ^G)
1> Worker3: I'm still alive
Worker1: I'm still alive
Worker2: I'm still alive
Worker4: I'm still alive
Worker3: I'm still alive
Worker1: I'm still alive
Worker4: I'm still alive
Worker2: I'm still alive
Worker3: I'm still alive
Worker1: I'm still alive
Worker4: I'm still alive
Worker3 (<0.87.0>) went down: killed
Old Workers:
[{{<0.85.0>,#Ref<0.0.4.292>},1},
{{<0.86.0>,#Ref<0.0.4.293>},2},
{{<0.87.0>,#Ref<0.0.4.294>},3},
{{<0.88.0>,#Ref<0.0.4.295>},4}]
New Workers:
[{{<0.2386.0>,#Ref<0.0.1.416>},3},
{{<0.85.0>,#Ref<0.0.4.292>},1},
{{<0.86.0>,#Ref<0.0.4.293>},2},
{{<0.88.0>,#Ref<0.0.4.295>},4}]
Worker2: I'm still alive
Worker1: I'm still alive
Worker2: I'm still alive
Worker1: I'm still alive
Worker1: I'm still alive
Worker4: I'm still alive
Worker3: I'm still alive
Worker2: I'm still alive
Worker1: I'm still alive
Worker3: I'm still alive
Worker4: I'm still alive
Worker1: I'm still alive
Worker4 (<0.88.0>) went down: killed
Old Workers:
[{{<0.2386.0>,#Ref<0.0.1.416>},3},
{{<0.85.0>,#Ref<0.0.4.292>},1},
{{<0.86.0>,#Ref<0.0.4.293>},2},
{{<0.88.0>,#Ref<0.0.4.295>},4}]
New Workers:
[{{<0.5322.0>,#Ref<0.0.1.9248>},4},
{{<0.2386.0>,#Ref<0.0.1.416>},3},
{{<0.85.0>,#Ref<0.0.4.292>},1},
{{<0.86.0>,#Ref<0.0.4.293>},2}]
Worker3: I'm still alive
Worker2: I'm still alive
Worker4: I'm still alive
Worker1: I'm still alive
Worker3: I'm still alive
Worker3: I'm still alive
Worker2: I'm still alive
Worker1 (<0.85.0>) went down: killed
Old Workers:
[{{<0.5322.0>,#Ref<0.0.1.9248>},4},
{{<0.2386.0>,#Ref<0.0.1.416>},3},
{{<0.85.0>,#Ref<0.0.4.292>},1},
{{<0.86.0>,#Ref<0.0.4.293>},2}]
New Workers:
[{{<0.5710.0>,#Ref<0.0.1.10430>},1},
{{<0.5322.0>,#Ref<0.0.1.9248>},4},
{{<0.2386.0>,#Ref<0.0.1.416>},3},
{{<0.86.0>,#Ref<0.0.4.293>},2}]
Worker2: I'm still alive
Worker3: I'm still alive
Worker4: I'm still alive
Worker3: I'm still alive
I think the version below is probably more efficient: it uses lists:map() to both search for and replace the crashed worker, so it only traverses the Worker's list once:
-module(mo).
-compile(export_all).
worker(Id) ->
timer:sleep(1000 * rand:uniform(5)),
io:format("Worker~w: I'm still alive~n", [Id]),
worker(Id).
create_workers(N) ->
Workers = [ % { {Pid, Ref}, Id }
{ spawn_monitor(?MODULE, worker, [Id]), Id }
|| Id <- lists:seq(1,N)
],
monitor_workers(Workers).
monitor_workers(Workers) ->
receive
{'DOWN', Ref, process, Pid, Why} ->
CrashedWorker = {Pid, Ref},
NewWorkers = replace(CrashedWorker, Workers, Why),
io:format("Old Workers:~n~p~n", [Workers]),
io:format("New Workers:~n~p~n", [NewWorkers]),
monitor_workers(NewWorkers);
_Other ->
monitor_workers(Workers)
end.
replace(CrashedWorker, Workers, Why) ->
lists:map(fun(PidRefId) ->
{ {Pid,_Ref}=Worker, Id} = PidRefId,
case Worker =:= CrashedWorker of
true -> %replace worker
io:format("Worker~w (~w) went down: ~s~n",
[Id, Pid, Why]),
{spawn_monitor(?MODULE, worker, [Id]), Id}; %=> { {Pid,Ref}, Id }
false -> %leave worker alone
PidRefId
end
end,
Workers).
start() ->
observer:start(), %%In the Processes tab, you can right click on a worker and kill it.
create_workers(4).
If so then what happens if one of those spawned monitors failed/crashed?
Erlang owns several server farms in different countries, and erlang has acquired several redundant power grids, so erlang will restart everything in a fault tolerant, distributed system that will never fail. It's all built in. You don't have to worry about anything. :)
Actually...anywhere that you can imagine something failing, then it has to be backed up, e.g. by another monitoring process on another computer.

Do not spawn and then monitor, that uses to cause issues on production on the past, instead use spawn_monitor
You can start and monitor multiple process from your supervisor, if you check the documentation on monitor you will notice that every time a monitored process died, it will send a message like:
{'DOWN', MonitorRef, Type, Object, Info}
to the supervisor process that is monitoring the process that just died
And then you can decide what to do, MonitorRef is the Reference that you got when you started to monitor the process, Object will have the Pid of the process that died, the registered name if you assigned it a name.
It is a nice exercise to create some sample code using monitor, but try to stick to the OTP library and OTP Supervisors instead.

Related

What does "bw: SpinningDown" mean in a RedisTimeoutException?

What does "bw: SpinningDown" mean in this error -
Timeout performing GET (5000ms), next: GET foo!bar!baz, inst: 5, qu: 0, qs: 0, aw: False, bw: SpinningDown, ....
Does it mean that the Redis server instance is spinning down, or something else?
It means something else actually. The abbreviation bw stands for Backlog-Writer, which contains the status of what the backlog is doing in Redis.
For this particular status: SpinningDown, you actually left out the important bits that relate to it.
There are 4 values being tracked for workers being Busy, Free, Min and Max.
Let's take these hypothetical values: Busy=250,Free=750,Min=200,Max=1000
In this case there are 50 more existing (busy) threads than the minimum.
The cost of spinning up a new thread is high, especially if you hit the .NET-provided global thread pool limit. In which case only 1 new thread is created every 500ms due to throttling.
So once the Backlog is done processing an item, instead of just exiting the thread, it will keep it in a waiting state (SpinningDown) for 5 seconds. If during that time there still is more Backlog to process, the same thread will process another item from the Backlog.
If no Backlog item needed to be processed in those 5 seconds, the thread will be exited, which will eventually lead to a decrease in Busy (existing) threads.
This only happens for threads above the Min count of course, as those will be kept alive even if there is no work to do.

Can't get Messages from ActiveMQ Queue

I'm running ActiveMQ 5.14.5. I have a Queue with some Pending messages. Screenshot from the console:
There are no active consumers.
The console reports that there are 21651 messages. However if I try and view them, it appears to be empty:
Furthermore when I try to call receive() on my org.apache.activemq.jms.pool.PooledConnection it blocks and receives no messages.
I'm fairly sure that there are messages there, and they should be retrieved. This used to work, and has stopped working.
Is there an explanation for this? There aren't any errors in the log.
Edit:
I'm using the Java client in Clojure. I didn't want to share it because it might confuse matters, but here it is. I'm using a Pooled factory in a couple of different threads. But I think the above example using the console is self-contained.
(let [factory (org.apache.activemq.ActiveMQConnectionFactory.
"Username"
"Password"
"URI")
pooled-connection-factory (org.apache.activemq.jms.pool.PooledConnectionFactory.)]
(.setConnectionFactory pooled-connection-factory factory)
(.start pooled-connection-factory)
(with-open [connection (.createConnection factory)]
(let [session (.createSession connection false, javax.jms.Session/AUTO_ACKNOWLEDGE)
destination (.createQueue session (:queue-name config))
consumer (.createConsumer session destination)]
(.start connection)
(loop [message (.receive consumer)]
(println (.getText ^org.apache.activemq.command.ActiveMQTextMessage message))
(recur (.receive consumer))))))

kombu not reconnecting to RabbitMQ

I have two servers, call them A and B. B runs RabbitMQ, while A connects to RabbitMQ via Kombu. If I restart RabbitMQ on B, the kombu connection breaks, and the messages are no longer delivered. I then have to reset the process on A to re-establish the connection. Is there a better approach, i.e. is there a way for Kombu to re-connect automatically, even if the RabbitMQ process is restarted?
My basic code implementation is below, thanks in advance! :)
def start_consumer(routing_key, incoming_exchange_name, outgoing_exchange_name):
global rabbitmq_producer
incoming_exchange = kombu.Exchange(name=incoming_exchange_name, type='direct')
incoming_queue = kombu.Queue(name=routing_key+'_'+incoming_exchange_name, exchange=incoming_exchange, routing_key=routing_key)#, auto_delete=True)
outgoing_exchange = kombu.Exchange(name=outgoing_exchange_name, type='direct')
rabbitmq_producer = kombu.Producer(settings.rabbitmq_connection0, exchange=outgoing_exchange, serializer='json', compression=None, auto_declare=True)
settings.rabbitmq_connection0.connect()
if settings.rabbitmq_connection0.connected:
callbacks=[]
queues=[]
callbacks.append(callback)
# if push_queue:
# callbacks.append(push_message_callback)
queues.append(incoming_queue)
print 'opening a new *incoming* rabbitmq connection to the %s exchange for the %s queue' % (incoming_exchange.name, incoming_queue.name)
incoming_exchange(settings.rabbitmq_connection0).declare()
incoming_queue(settings.rabbitmq_connection0).declare()
print 'opening a new *outgoing* rabbitmq connection to the %s exchange' % outgoing_exchange.name
outgoing_exchange(settings.rabbitmq_connection0).declare()
with settings.rabbitmq_connection0.Consumer(queues=queues, callbacks=callbacks) as consumer:
while True:
settings.rabbitmq_connection0.drain_events()
On the consumer side, kombu.mixins.ConsumerMixin handles reconnecting when the connection goes away (and also does heartbeats, etc., and lets you write less code). There doesn't seem to be a ProducerMixin, unfortunately but you could potentially dig into the code and adapt it...?

Globally registered process is not registered

I'm using
spawn (node, module, function, Args)
global:register_name(name, pid)
To register a process on a different node globally.
Here's the code
Pid = spawn(mi, loop, [X]),
io:format("Glavni PID: ~w~n", [Pid]),
register(glavni, Pid),
Pid1 = spawn (prvi#Molly, mi, loop_prvi, []),
io:format("Prvi PID: ~w~n", [Pid1]),
global:register_name (prvi, Pid1),
When I run the code, it doesn't throw any error but when I try whereis(process) I get undefined on a node that spawned it.
Here's what process' console says:
Pid = spawn(mi, loop, [X]),
io:format("Glavni PID: ~w~n", [Pid]),
register(glavni, Pid),
Pid1 = spawn (prvi#Molly, mi, loop_prvi, []),
io:format("Prvi PID: ~w~n", [Pid1]),
global:register_name (prvi, Pid1),
And when I try to whereis(process) from any node, either the master node or the node I created the process on, it says:
(prvi#Molly)2> whereis(prvi).
undefined
(prvi#Molly)3> whereis(prvi#Molly).
undefined
To register on several nodes you have to:
start several nodes
ensure they use the same cookie
connect them (for example in node A, execute net_adm:ping(B))
start a process on node B with spawn(node,...)
register it with global:register_name(name, Pid)
check the registration with global:whereis_name(name)
you miss at least the last point, but all of them are necessary.

In celery, how to ensure tasks are retried when worker crashes

First of all please don't consider this question as a duplicate of this question
I have a setup an environment which uses celery and redis as broker and result_backend. My question is how can I make sure that when the celery workers crash, all the scheduled tasks are re-tried, when the celery worker is back up.
I have seen advice on using CELERY_ACKS_LATE = True , so that the broker will re-drive the tasks until it get an ACK, but in my case its not working. Whenever I schedule a task its immediately goes to the worker which persists it until the scheduled time of execution. Let me give some example:
I am scheduling a task like this: res=test_task.apply_async(countdown=600) , but immediately in celery worker logs i can see something like : Got task from broker: test_task[a137c44e-b08e-4569-8677-f84070873fc0] eta:[2013-01-...] . Now when I kill the celery worker, these scheduled tasks are lost. My settings:
BROKER_URL = "redis://localhost:6379/0"
CELERY_ALWAYS_EAGER = False
CELERY_RESULT_BACKEND = "redis://localhost:6379/0"
CELERY_ACKS_LATE = True
Apparently this is how celery behaves.
When worker is abruptly killed (but dispatching process isn't), the message will be considered as 'failed' even though you have acks_late=True
Motivation (to my understanding) is that if consumer was killed by OS due to out-of-mem, there is no point in redelivering the same task.
You may see the exact issue here: https://github.com/celery/celery/issues/1628
I actually disagree with this behaviour. IMO it would make more sense not to acknowledge.
I've had the issue, where I was using some open-source C libraries that went totaly amok and crashed my worker ungraceful without throwing an exception. For any reason whatsoever, one can simply wrap the content of a task in a child process and check its status in the parent.
n = os.fork()
if n > 0: //inside the parent process
status = os.wait() //wait until child terminates
print("Signal number that killed the child process:", status[1])
if status[1] > 0: // if the signal was something other then graceful
// here one can do whatever they want, like restart or throw an Exception.
self.retry(exc=SomeException(), countdown=2 ** self.request.retries)
else: // here comes the actual task content with its respected return
return myResult // Make sure there are not returns in child and parent at the same time.