MPI_Test isn't picking up a completed MPI_Send - interrupt

I've spent three days looking for an answer so I hope you'll bear with me if this has already been addressed and I've been mighty unlucky finding a solution.
I'm using Fortran (eugh!) but this is a generic MPI query.
Scenario (simplified for this example):
Processes 0 and 1 communicate with process 2 (but not with each other)
0 & 1 do lots of sends/receives
2 does lots of receives/process/sends (but each pair is done twice so as to
pick up both 0 & 1)
0 & 1 will eventually stop - I know not when! - so I do an MPI_Send from each when appropriate using the rank of the 3rd process (filter_rank_id=2) and a special tag (c_tag_open_rcv=200), with a logical TRUE in the buffer (end_of_run). Like this:
CALL MPI_SEND(end_of_run, 1, MPI_LOGICAL, filter_rank_id, c_tag_open_rcv, mpi_coupling_comms, mpi_err)
The problem arises in process 2... it's busy doing its MPI_Recv/MPI_Send pairs and I cannot break out of it. I have set up a non-blocking receive for each of the other two processes and stored the request handles:
DO model_rank_id= 0, 1
!Set up a non-blocking receive to get notification of end of model run for each model
end_run = end_model_runs(model_rank_id) !this is an array of booleans initialised to FALSE
CALL MPI_IRECV(end_run, 1, MPI_LOGICAL, model_rank_id, &
c_tag_open_rcv, coupling_comms, mpi_request_handle, mpi_err)
!store the handle in an array
request_handles(model_rank_id) = mpi_request_handle
END DO
where model_rank_id is the process number in the MPI communicator i.e. 0 or 1.
Later on, busy doing all those receive/send pairs, I always check whether anything's arrived in the buffer:
DO model_rank_id= 0, 1
IF (end_model_runs(model_rank_id) .EQV. .FALSE.) THEN
CALL MPI_TEST(request_handles(model_rank_id), run_complete, mpi_status, mpi_err)
IF (run_complete .eqv. .FALSE.) THEN
!do stuff... receive/process/send
ELSE
!run is complete
!___________removed this as I realised it was incorrect__________
!get the stop flag for the specific process
CALL MPI_RECV(end_run, 1, MPI_LOGICAL, model_rank_id, &
c_tag_open_rcv, coupling_comms, mpi_err)
!____________end_________________________________________________
!store the stop flag so I can do a logical 'AND' on it and break out when
!both processes have sent their message
end_model_runs(model_rank_id) = end_run
END IF
END IF
END DO
Note that this snippet is contained in a loop which carries on until all the stop flags are TRUE.
I know it's fairly complex, but this can't be that hard, can it? If anyone can see the error that'd be fantastic, or even suggest a better way to do it.
Huge thanks in advance.

Your program is probably stuck in the MPI_RECV call. The reason is that having a positive completion flag as returned by MPI_TEST means that MPI_IRECV has received the message. Unless the sender sends another message with the same tag, MPI_RECV will simply block and wait, in your case probably indefinitely. Apart from that, you are issuing two MPI_IRECV calls with the same receive buffer which is probably not what you really want to do since end_run = end_model_runs(model_rank_id) does not copy the address of the array element into end_run but rather its value.
Your code should look like this:
DO model_rank_id= 0, 1
!Set up a non-blocking receive to get notification of end of model run for each model
CALL MPI_IRECV(end_model_runs(model_rank_id), 1, MPI_LOGICAL, model_rank_id, &
c_tag_open_rcv, coupling_comms, request_handle, ierr)
!store the handle in an array
request_handles(model_rank_id) = request_handle
END DO
...
DO model_rank_id= 0, 1
IF (end_model_runs(model_rank_id) .EQV. .FALSE.) THEN
CALL MPI_TEST(request_handles(model_rank_id), run_complete, status, ierr)
IF (run_complete .eqv. .FALSE.) THEN
!do stuff... receive/process/send
ELSE
!run is complete
!the stop flag is ALREADY in end_model_runs(model_rank_id)
!do a logical 'AND' on it and break out when
END IF
END IF
END DO
As a side note, using your own identifiers that start with mpi_ is a terrible idea since those might clash with symbols provided by the MPI library. You should really treat mpi_ as a reserved prefix and never use it while naming your own variables, subroutines, etc. I've fixed that for you in the code above.

I solved this eventually after a lot of experimentation, it was actually quite simple (isn't it always?)
The problem was due to the fact that processes 0 & 1 could end and post their "I'm finished" messages OK, but process 2 was in such a tight loop doing the test and recv/send pair (the outer loops on both sets of send/recv's omitted for clarity in original past), that the test would fail and the process would stick in the blocking MPI_RECV.
First I tried a sleep(3) which made it work, but it couldn't sleep on every loop without really bad effects on perfomance, then I tried an MPI_IPROBE but hit the same problem as the test. In the end, a timeout around the MPI_IPROBE did the trick, thus:
DO iter1 = 1, num_models
!Test each model in turn and ensure we do the comms until it has finished
IF (end_model_runs(iter1) .EQV. .FALSE.) THEN
model_rank_id= models(iter1)
now = TIME()
DO WHILE (TIME() .LT. now + timeout)
msg_flag = .FALSE.
CALL MPI_IPROBE(model_rank_id, c_tag, coupling_comms, &
msg_flag, empi_status, empi_err)
IF (msg_flag .EQV. .TRUE.) THEN
!Message waiting
EXIT
END IF
END DO
IF (msg_flag .EQV. .TRUE.) THEN
CALL MPI_RECV(state_vector, num_state_params, MPI_DOUBLE_PRECISION, &
model_rank_id, c_tag, coupling_comms, empi_status, empi_err)
ELSE !no message waiting, flag should be False, i.e. the run *has* finished
end_model_runs(iter1) = .NOT. msg_flag
END IF
END IF
END DO
and this code inside a loop which breaks once all the members of end_model_runs are TRUE.
I hope this helps someone else - and saves them three days of effort!

Related

How to debug no message matching on assert_receive

I have this test
defmodule InfoSys.Backends.WolframTest do
use ExUnit.Case, async: true
alias InfoSys.WolFram
test "make request, report results, then terminates" do
ref = make_ref()
{:ok, pid} = WolFram.start_link("1 + 1", ref, self(), 1)
assert_receive {:results, ^ref, [%InfoSys.Result{text: "2"}]}
end
end
and I am receiving
No message matching {:results, ^ref, [%InfoSys.Result{text: "2"}]}
after 100ms.
How can I know which message is ref receiving or how can I debug this error? I am following phoenix book programming example
Whole ExUnit stacktrace:
1) test make request, report results, then terminates (InfoSys.Backends.WolframTest)
test/backends/wolfram_test.exs:6
No message matching {:results, ^ref, [%InfoSys.Result{text: "2"}]} after 100ms.
The following variables were pinned:
ref = #Reference<0.214527998.4009754625.219099>
The process mailbox is empty.
code: assert_receive {:results, ^ref, [%InfoSys.Result{text: "2"}]}
stacktrace:
test/backends/wolfram_test.exs:11: (test)
I've ran on these kind of problems before. There are two probable causes:
Your process is taking longer than 100 ms to give you the result.
Your process has a bug.
I prefer to rule out the first possibility and the thing I do is making the test wait more than the default time which is 100 ms. assert_receive can receive a second argument which is the time in milliseconds. You can try something like this:
assert_receive {:results, ^ref, [%InfoSys.Result{text: "2"}]}, 5000
where 5000 means 5000 ms or 5 seconds.
After increasing the time, there are two possible outcomes when running a test like this:
Your test mailbox is not empty: your process is slow when processing your requests. Nothing wrong with the logic. Maybe the thing your process is doing is an expensive operation or you need to improve your code's efficiency.
Your test mailbox is empty:
There is definitely a problem with the logic of your process and you
should check the code.
You have underestimated the complexity of the operation your process should do and you should find a more efficient way of doing it or change your solution.
I've seen the code you are taking about and it's not too complex (next time you should post it as part of the question), so if you don't receive a response in your test after 5 seconds, then there is definitely something wrong with your process' code.
I hope this helps.

MSMQ queue with multiple processes reading

I had a MSMQ application setup where data was being pushed into one queue. Initially I only had one process reading from it and processing it. Since the volume has increased I started multiple processes to read from it which is basically a new instance of my original process. I do not see any errors but the performance has really dropped. My understanding is that each process will read from a queue and receive a new message that has not yet been processed and continue with that. Is this correct or is it possible that multiple processes could end up processing the same message?
Dim q As MessageQueue
If MessageQueue.Exists(".\private$\MsgsIQueue") Then
q = New MessageQueue(".\private$\MsgsIQueue")
Else
'GS - If there is no queue then we're done here
Console.WriteLine("Queue has not been created!")
Return
End If
While True
Dim message As Message
counter += 1
Try
If q.Transactional = True Then
Thread.Sleep(2000)
End If
q.MessageReadPropertyFilter.ArrivedTime = True
message = q.Peek(TimeSpan.FromSeconds(20.0))
message.UseJournalQueue = True
message = q.Receive(New TimeSpan(0, 0, 60))
message.Formatter = New XmlMessageFormatter
(New [String]() {"System.String"})
ProcessMessage(message)
....
Ok, are you sure that it is the queue reading that is actually causing the performance degradation? I would suspect that there is some other bottleneck in your pipeline as MSMQ is really good at handling reading from multiple processes/threads.
If I take a look at your code I would suggest the following changes:
Why sleep for 2 secs if is a tx queue? Always use tx queues and move the call to Sleep to the catch block to have a wait interval if the queue is empty.
Move the setting of the filter outside of the loop.
Remove the call to Peek as it performs nothing of value.
Use journal queue is only of use when sending messages. So remove it.
Set the formatter on the queue instead and it will be used for all reads.
You should also wrap the call to Read and ProcessMessage within a TransactionScope where you also wrap ProcessMessage in another try/catch block. This way you can commit the read if everything went Ok in ProcessMessage or otherwise choose to abort the read or move the message to a dead letter queue.

Does exception handling in Clarion exist?

Does Clarion 8 offer anything for exception handling? I know as of Clarion 5 there was no support for things like try / catch but that was released almost 10 years ago. I can't seem to find any info on how to recover from exceptions in C6 to C8 unless I was using Clarion# (aka Clarion.NET) which I'm not. If there's definitely nothing like try / catch, are there any tricks or hacks that can be used to not have a program crash when an exception is thrown even if it goes unhandled?
If it helps, I'm using version 8.0.0.8778.
EDIT 1:
Here is some sample code for a basic program that should supposedly illustrate the feature PROP:LastChanceHook, however, I can't get it to work. When I run this program, I see the first message "Start", but then nothing happens. I've tried returning 0 or 1 from Hook but that hasn't made a difference either. Every time I run this, I have to go onto the Task Manager and end the process for the program because it's not being killed.
PROGRAM
INCLUDE('CWEXCPT.INT'), ONCE
MAP
Hook(*ICWExceptionInfo), LONG
Test(LONG,LONG)
END
CODE
MESSAGE('[Sample] Start')
SYSTEM{PROP:LastChanceHook} = ADDRESS(Hook)
Test(10, 0) ! Intentionally causes an exception
MESSAGE('[Sample] After Test')
RETURN ! Tried removing this, no difference
Hook PROCEDURE(*ICWExceptionInfo info)
CODE
MESSAGE('[Sample] Start Hook')
IF info &= NULL THEN RETURN 0 END
Message('An exception!')
RETURN 1 ! 0 = don't kill, anything > 0 = tell RTL to kill the thread
Test PROCEDURE (LONG a, LONG b)
CODE
a %= b
Yes, take a look at prop:LastChanceHook in the help. It may provide enough function for your needs.
In other cases, the info at this link might also be useful:
http://clarionsharp.com/blog/tracking-down-those-pesky-gpfs/
In the next public build of C8 (it's presently Sept 27, 2012), the buttons on that exception display (shown at the link above) can be customized a bit.

multi threading - add more threads and continue the operation

ok here is my code :
For i = 0 To 10
Dim tTemp As Threading.Thread = New Threading.Thread(AddressOf dwnld)
tTemp.IsBackground = True
'tTemp.Start(geturl)
lThreads.Add(tTemp)
'MsgBox(lThreads.Item(i).ThreadState)
Next
I create a list of threads with 10 threads, assign them a function, properties and add them to the list.
'While ListBox2.Items.Count > 0
For i = 0 To lThreads.Count - 1
If (lThreads.Item(i).ThreadState = 12) Then
If (ListBox2.Items.Count > 0) Then
lThreads.Item(i).Start(geturl)
If (i = lThreads.Count - 1) Then
i = 0
End If
Else
Exit For
End If
'MsgBox(lThreads.Item(i).ThreadState)
ElseIf (lThreads.Item(i).ThreadState = 16) Then
lThreads.RemoveAt(i)
Dim tTemp As Threading.Thread = New Threading.Thread(AddressOf dwnld)
tTemp.IsBackground = True
lThreads.Add(tTemp)
If (i = lThreads.Count - 1) Then
i = 0
End If
End If
Next
What's happening is, i see the threads stop after the function dwnld is completed. So i first check for the state (12 means background and unstarted). On case 12 start the thread and in case 16 (stopped) remove that particular thread and add a different thread like i add 10 above.
Also there is a check when the i counter reaches last number, restart the whole loop by assigning i=0.
The program downloads some web pages, the url is passed from the listbox2. The geturl will pass the url and remove it from the list. So when the listbox is empty, exit the for loop.
But the above code is running for only 11 times and it does not restart. I tried using a lable and goto but it simple hangs.
Can anyone tell me what to do?
What i want is to maintain 10 threads to keep downloading the web pages and when the list is empty, exit the function.
Trying to manually manage your own custom pool of threads is probably the wrong approach here. Use ThreadPool.QueueUserWorkItem or preferrably the new Task class. The thread pooling is managed for you which greatly simplifies the code. Completely scrap this code and start over using one of the techniques I just mentioned. If you run into problems implementing either of these techniques then post a more specific question.
Micro-management of threads is, well, just a really bad idea. The moment I see anyone trying to maintain a list of threads that are continually created, terminated and destroyed I just know they are doomed. I have seen experienced professionals trying to do it - it's fun looking on, waiting for the inevitable spectacular failure after months of trying to fix the unfixable.
Thread pools are, typically, nothing of the sort. They are usually a pool of tasks - task class instances on a producer-consumer queue - that several threads feed off as and when they are free to do work. The work threads auto-manage themselves by getting new tasks themselves when they have finished with the old one - no need for any higher-level micro management.
Listen to #Brian - forget managing lists of threads, checking their state and all that gunge. It'll just make you ill. Go with ThreadPool.QUWI or Tasks.

Handling Lua errors in a clean and effective manner

I'm currently trying to code an add-on the popular game World Of Warcraft for a friend. I don't understand too much about the game myself and debugging it within the game is difficult as he's having to do all the testing.
I'm pretty new to Lua, so this may be a very easy question to answer. But when a Lua error occurs in WoW it throws it on screen and gets in the way, this is very bad to a game player as it will stop their gameplay if it throws the exception at the wrong time. I'm looking for a way to cleanly handle the error being thrown. Here's my code so far for the function.
function GuildShoppingList:gslSlashProc()
-- Actions to be taken when command /gsl is procced.
BankTab = GetCurrentGuildBankTab()
BankInfo = GetGuildBankText(BankTab)
local Tabname, Tabicon, TabisViewable, TabcanDeposit, TabnumWithdrawals, remainingWithdrawals = GetGuildBankTabInfo(BankTab)
p1 = BankInfo:match('%-%- GSL %-%-%s+(.*)%s+%-%- ENDGSL %-%-')
if p1 == nil then
self:Print("GSL could not retrieve information, please open the guild bank and select the info tab allow data collection to be made")
else
self:Print("Returning info for: "..Tabname)
for id,qty in p1:gmatch('(%d+):(%d+)') do
--do something with those keys:
local sName, sLink, iRarity, iLevel, iMinLevel, sType, sSubType, iStackCount = GetItemInfo(id);
local iSum = qty/iStackCount
self:Print("We need "..sLink.." x"..qty.."("..iSum.." stacks of "..iStackCount..")")
end
end
end
The problem being when checking to see if p1 is nil, it still throws a Lua error about trying to call p1 as nil. It will be nil at times and this needs to be handled correctly.
What would be the correct most efficient way to go about this?
You might want to wrap your function in a pcall or xpcall which enables you to intercept any error thrown by Lua.
Aside of that, I personally find this construct easier to read:
p1=string.match(str,pat)
if p1 then
-- p1 is valid, eg not nil or false
else
-- handle the problems
end