Anylogic - Internal Error(s): Engine still has x events scheduled: xyz: [null] - error-handling

After stopping my simulation I occasionally get the following error message:
Example:
Exception during stopping the engine:
INTERNAL ERROR(S):
Engine still has 6 events scheduled: 2386.0: [null]
java.lang.RuntimeException: INTERNAL ERROR(S):
Engine still has 6 events scheduled: 2386.0: [null]
at com.anylogic.engine.Engine.g(Unknown Source)
at com.anylogic.engine.Engine.stop(Unknown Source)
at com.anylogic.engine.ExperimentSimulation.stop(Unknown Source)
at com.anylogic.engine.gui.ExperimentHost.executeCommand(Unknown Source)
at com.anylogic.engine.internal.webserver.l.onCommand(Unknown Source)
...
My simulation model looks like this:
Simulation Model with 5 Machines
The Model is a simulation of a job shop scheduling problem and does the following:
Generate Job Agents through inject(20) in the source Block
The jobs go to the machine defined by a database and wait in the wait-block
The jobs are set free from the wait-block by other agents
The jobs are processed in the service block
The jobs repeat the process 4 additional times
There are overall 5 agents in Step 3 - let's call them Scheduling Agent - and they use the Wait.free() method to set the agents free. One agent controls one wait-block. All 5 Scheduling Agents work simultaneously and are synchronized through the Main agent (Main notifies the Scheduling Agents). The hold-blocks are unblocked immediately after simulation start. They exist also for synchronisation purposes. Every Scheduling Agent owns his own Thread which is started through Thread.start() by a Timeout Event (Occurs once, time = 0) defined in Main.
A Thread from a Scheduling Agent looks something like this:
new Thread(new Runnable() {
public void run() {
synchronized (sync_obj) {
sync_obj.waituntilJobarrives();
sync_obj.Waitblock.free(a_Job);
synv_obj.waituntilJobisfinished();
repeat();
}
}
});
Now here is my Problem: When I start the simulation, the jobs are generated normally and move to their assigned wait-block. After that, the scheduling agents start their work and free a Job, but sometimes the Scheduling Agent calls the Waitblock.free() method and the Job is not set free (checked with traceln() when the method was called). To double check the issue, I implemented buttons, which manually calls the Waitblock.free() method but the Job Agents still won't leave the wait-block. If the Job is not set free by the agent the simulation of the job shop is stuck there. The simulation keeps running, but the 20 Jobs get never finished and no error message is displayed (technically there is no error). Only after stopping the simulation the error message displayed above appears in the console.
What makes matters worse is the fact, that this error does not appear all the time. Sometimes the simulation works just fine and sometimes the wait-block stops reacting. Usually, after simulating long enough this error will appear and one or several wait-blocks stop reacting.
My guess from reading the error message is, that the engine received the order to free the agents from the wait-block. It just won't do it now. How or can I control the order of events scheduled by the engine (Personal Learning Edition)? Or is there another way of fixing the problem?
I am grateful for any help!
EDIT: By removing the Hold-block, the error of Engine still has X events scheduled does not appear that often. But the 'Wait-Block' still does not respond to the Waitblock.free()method and the following Error message appears in the console:
java.lang.RuntimeException: root.w_Warteblock1.readyEntities.output.readyNotificationAsync.event: negative timeout: -1.25
at com.anylogic.engine.Engine.error(Unknown Source)
at com.anylogic.engine.EventOriginator.g(Unknown Source)
at com.anylogic.engine.EventOriginator.c(Unknown Source)
at com.anylogic.engine.EventTimeout.restart(Unknown Source)
at com.anylogic.libraries.processmodeling.AsynchronousExecutor_xjal.a(Unknown Source)
at com.anylogic.libraries.processmodeling.OutputBlock.notifyReady(Unknown Source)
at com.anylogic.libraries.processmodeling.OutputBuffer.a(Unknown Source)
at com.anylogic.libraries.processmodeling.OutputBuffer.take(Unknown Source)
at com.anylogic.libraries.processmodeling.Wait.free(Unknown Source)
This looks more like a common error, which I can catch so, my current workaround is with a try and catch block around the Thread which calls the Waitblock.free() method and restarting the simulation with the simulation progress saved in an excel file.

I will tell you my thoughts, but the info might not be enough to make a conclusion:
I remember that I get this error when I pause the simulation then remove an agent and then stop the simulation. If I follow those steps, I will get that error...
This means that when you stop your simulation, you need to give the simulation at least a millisecond of time to be able to finish the scheduled events... In this case your scheduled events are on the thread. So a solution for that would be to stop the simulation with finishSimulation() before you click the stop button. You have to kill the threads before the finishSimulation() function runs... I'm not sure about this, but give it a try.
That's the first problem... the second problem I think is related to the hold after the wait. Notice that if your hold block is blocked and you try to release more than 1 agent from the wait block... only 1 agent will be freed when you unblock the hold. This is because there is space for only 1 agent at the exit of the wait block... if you make this mistake, the agent will stay in the wait block forever.. a solution is to use a queue just after the wait block. I don't think this problem is related to error you get though...

I had this problem in my test suites. I fixed it by calling:
engine.finish();
instead of:
engine.stop();

Related

Blocked system-critical thread has been detected

I'm using Ignite.NET 2.7.6. There is a configuration from one server and about 40 clients. After 8 hours of work, the server starts behaving strangely: clients cannot connect it, some queries have no result, etc.
On the server's side, the memory consumption is ok, the amount of threads is about 250 and all looks ok. I don't see any problems, so I decided to solve all the problems on the server's side that were marked as SEVERE.
The first one I encounter is:
Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=tcp-comm-worker, blockedFor=13s]
So I want to understand the reason this happens.
Full server's log can be found here:
https://yadi.sk/d/LF03Vz5vz4tRcw
https://yadi.sk/d/MMe0xrgI3k6lkA
Added:
The issue doesn't seem to be innocuous, this message appears every second from various threads, the "blockedFor" value is increasing from seconds to hours.
The load on the server is low but as the servers' threads become locked, it stops responding and registering new clients.
Here are logs from the server:
https://yadi.sk/d/tc3g2hb9B0jtvg
https://yadi.sk/d/05YrlYXcp4xPqg
This is the log from one client:
https://yadi.sk/d/bcbQ7ee4PUzq2w
The client's log's last lines are at 19:03:52, when the server was restarted.
I see the following .NET specific exception among the others but it should be triggered by another issue. Anyway, this one is reported to the community.
class org.apache.ignite.IgniteException: Platform error:System.NullReferenceException: Ññûëêà íà îáúåêò íå óêàçûâàåò íà ýêçåìïëÿð îáúåêòà.
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.CacheEntryFilterApply(Int64 memPtr)
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.InLongOutLong(Int32 type, Int64 val)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.loggerLog(PlatformProcessorImpl.java:404)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:460)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:512)
at org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutLong(PlatformTargetProxyImpl.java:67)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackUtils.inLongOutLong(Native Method)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackGateway.cacheEntryFilterApply(PlatformCallbackGateway.java:143)
at org.apache.ignite.internal.processors.platform.cache.PlatformCacheEntryFilterImpl.apply(PlatformCacheEntryFilterImpl.java:70)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager$InternalScanFilter.apply(GridCacheQueryManager.java:3139)
The very first exceptions are related to the communication issues at the networking level. See below:
java.io.IOException: Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Unknown Source)
[18:46:12,846][WARNING][grid-nio-worker-tcp-comm-0-#48][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå]
[18:46:13,861][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47101, failureDetectionTimeout=10000]
[18:46:14,893][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=BB-SRV-DELTA/169.254.40.231:47101, failureDetectionTimeout=10000]
It looks like that either the server or some clients don't react to heartbeats or to other networking requests within 10 seconds. Check the logs of the client nodes as well. You might need to scale out your cluster adding more servers for the sake of load balancing or adjust the failureDetectionTimeou.
The Blocked system-critical thread has been detected... error message is innocuous but confusing. I've restarted the following conversation.
As Denis described, there are a lot of network communication issues.
In general, a client would like to perform some cache operation, but a server thread from the striped pool is blocked for a long time. I don't think it relates to the .NET part.
You can see following messages:
[18:53:04,385][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-7, blockedFor=13s]
If you take a look at the thread:
hread [name="sys-stripe-7-#8", id=28, state=WAITING, blockCnt=51, waitCnt=3424]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2911)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)
at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1879)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1904)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1875)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1857)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1275)
at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1212)
The thread is trying to send a Continuous Query callback but is failing to establish a connection to a client node. This causes the thread to be blocked and it can not serve other cache API requests that require the same partition.
At first glance, you could try to reduce #clientFailureDetectionTimeout, default is 30sec. But this won't fix the network issues completely.

Catch the event of a blocked instance only after a timeout

I have a program where I start several process instances using a cron. For each process instance I have a maximum time, and if the execution time exceeds it, I have to consider it as failure and use some specific methods.
For now what I did was simply to check, once my process instance has finished, if the elapsed time exceeds or not the given maximum time.
But what if my process instance gets blocked for some reason (e.g. server not responding)? I need to catch this event and perform failure operations as soon as the process gets blocked and timeout is exceeded.
How can I catch these two conditions?
I had a look at the FlowableEngineEventType, but there isn’t a PROCESS_BLOCKED/SUSPENDED type of event. But, even if it were, how do I fire it only if a certain amount of time has passed?
I assume that this is the same question as this from the Flowable Forum.
If you are using the Flowable HTTP Task then have a look at the documentation to see how you can set the timeouts on it and how you can react on errors there. If you are firing GET requests from your own code you would need to write your own business logic that would throw some kind of BpmnError and you would then handle that in your process.
The Flowable Process instance does not have the concept of being blocked, and you have to manually to that in your modelling.

VxWorks Mutual exclusion semaphore locked by crashed TASK

I am facing an issue in our C based application where one of VxWorks TASK(say Task1) got crashed due to some unknown reasons. The crashed task had locked a mutual exclusion semaphore(say semA).
Now the next TASK2 is waiting on semA to get Unlocked. Since semA is locked by a crashed TASK, TASK2 will be waiting infinitely to grab semA. This has broken application functionality.
We can not provide a timeout to lock semA in TASK2 becuase semA is protecting a send routing that is sending data over sockets. Providing a timeout will result in failure in message communication.
After googling I have found ROBUST mutex for LINUX for such problem, but our platform is VxWorks(version 5.5.1).
So can somebody tell me the way by which we can handle this problem in VxWorks?
I have tried a below mentioned solution nut not sure how safe it is to do so.
1) TASK2 will wait on semA for a particular timout
2) if failed check the state of previous task that had locked the semA
3 if TASK1 state is SUSPENDED, TASK 2 will call semDelete on semA and than recreate it.
4) if TASK1 is not in SUSPENDED state, keep on waiting to grab semA.
I have test this code as prototype and is working fine. I am not sure about how good is to implement such solution where we recreate semaphore and what will be the possible risks imposed.
Please let me know your inputs.
Thanks
I think your prototyped solution is not anymore risky than having code (Task1) that crashes for unknown reasons.
If I were to work on your problem, I would first try really hard to find out why Task1 is crashing. If I were unable to figure out the root cause, I would then go to implement your proposed solution. That is, I would query the state of Task2 after a certain amount of time, and then recreate the semaphore.
I must say, that even if you implement your work around of recreating the semaphore, then you still have a crashed task which consumes resources. If this problem persists, then eventually the whole system will stop working.
In the end the correct and only way to fix this problem is to fix the crash in task1. You should be able to get a stack trace to where it crashed and fix it.
I second the previous answers: finding the cause why Task1 crashes is better than implementing a workaround.
Can you post the messages written by VxWorks of the crashed Task1?
One of the first things I try if a task crashes for no good reason is to increase its stack size (let's say double it). If the task runs fine your stack size is too small. Also try to increase the stack size of the task(s) you've modified lately!
If it is a stack problem it isn't neccessarily Task1 which is to blame...

Monitor and handle MSGW messages on a job on an IBM i-series (AS/400) from Java

Does anyone know how one can automatically reply to messages with status MSGW that block a job on an IBM i-series (AS/400)?
I'm using the jt400/jtopen library to access a program on an AS/400 from Java. I'm using the com.ibm.as400.access.ProgramCall class, which works fine, unless the program fails for some reason. As with almost any program, failures will happen sometimes, but unfortunately, in this case, it does not result in a status message or an exception. Instead, the calling thread just hangs. What's worse, any call to the AS/400 to get information on the Job (another class in jt400 that mostly does what you would expect) backing the queue will hang as well.
I could of course monitor the thread in which the call runs and simply kill it after waiting for a while, but that's a last resort. Getting an error message back from the system would be nice.
You could try execute this command before invoke your pcml with com.ibm.as400.access.CommandCall.run() method:
CHGJOB INQMSGRPY(*DFT)
It sets 'C' as default answer for all messages.
but you should ensure you have log of the messages in order to know the problem which generates this message
Regards,
I don't believe Java can directly trap errors that occur on the other side of that API. What I've done is to 'harden' the RPG (IBM i side) program so that it monitors for errors rather than let the default error handler get them. When an error occurs, the RPG program gracefully terminates and passes back an error code or even the entire message back to the Java application.
I've found that you can use the timeout mechanism of ExecutorService to interrupt a ProgramCall in MSGW.
You must discard the AS400 object afterwards, and the server job is still in MSGW, but at least you can continue on the Javaside.
(You need to use a separate AS400 object if you want to investigate on the hanging job.)

Task API - Handling Already Finished Task

I'm making an API and have a function which takes a task and runs it. When the task is finished successfully, it's status is set to 'Completed'. Now, lets say the user of the API accidentally (or for whatever reason) sends that same task (or any other already completed task) back into the same function. What should the API do?
Throw an exception
Pretend as if I've rerun the task and tell the user (through events or whatever) that it is done/completed (again).
Do nothing and just ignore it.
Is there a standard or best practice for something like this?
Pretending to rerun hides what's probably a user error - this can lead to deadlocks or other logic bugs (i.e. - I create an event, wait on it and run a task that should reset it at some point - it never happens, deadlock). Also done handlers may fail if invoked twice per one successful task run.
Doing nothing is more or less the same - done handlers can't fail now :), but they are not invoked at all - a bug is more probable if done handler performed necessary communication with the spawning thread.
The worst thing is - these may happen or not happen, depending on the timing. I.e. the task may still be running by the time the user calls the function the second time (what do you do then, by the way?)
So, do throw an exception unless task status is "not started". The user can always check the status and perform the necessary processing in the unlikely case she needs it.