How to fix stalled archival process with ServicePulse?

How to fix stalled archival process with ServicePulse? - nservicebus

In my environment I am running ServicePulse v1.14.4 and ServiceControl v3.3.0. I periodically go in to check for error messages and they're are in excess of 280K. So I start the archive process and hope they get archived. After 4 days the process is still going or seems to be hung up at 66%.
My question is there any way to get this back on track without having to restart the services?

Related

When I start My SAP MMC EC6 server one service is not getting to wait mode

Can someone of you help me, how to make the following service selected in the image get into wait mode after starting the server.
Please let me know if developer trace is required to be posted for resolving this issue.

that particular process is a BATCH process, a process that runs scheduled background tasks (maintained by transaction SM36/SM37). If the process is busy right after starting the server, that means there were scheduled tasks with status released waiting for execution, and as soon as the server was up, it started those tasks.
If you want to make sure the system doesn't immediately start released background tasks, you'll have to set the status back to scheduled (which, thanks to a bit of weird translation, means they won't be executed because they are not released).
if you want to start the server without having a chance to first change the job status in SM37, you would either have to reset the status on database level (likely not officially supported by SAP) or first start the server without any BATCH processes (which would give you a number of great big warning messages upon login) and change the job status before then restarting the server with the BATCH processes. You can set the number of processes for each type in the profile of your instance (parameter rdisp/wp_no_btc).

Spring Batch restart crashed jobs

Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex

Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()

Hangfire 1.3.4 - deleted jobs stuck in queue

We are running hangfire single threaded using BackgroundJobServerOptions.WorkerCount = 1 (because we have a requiement for ordered processing).
Most of the time this is no problem, but occasionally a job gets stuck for entirely expected reasons (eg, the actual code it is running goes into an infintite loop), but because we are running single threaded this prevents other jobs in the queue from starting.
In order to try and work around this, we delete the job, but then it stays on the queue, blocking any other job from starting:
The only way I have found to resolve this is to drop and recreate the hangfire DB which is obviously not great.
Why does deleting a running job in hangfire not also remove it from the queue? Is this weird delete behavior a bug which to be fixed in a later version, or is this behavior by design because we're running single threaded?
If this is by design then how do you cancel a processing job in a way which removes it from the queue?

Well it seems that this behavior is by design.
If the IIS app pool worker is recycled, Hangfire will start processing the next task immediately. However, without this restart Hangfire will "hang" indefinitely.
An issue was raised on github about this, however it has not been solved yet:
https://github.com/HangfireIO/Hangfire/issues/80
With no way to cancel or manually "fail" a job, this makes hangfire a lot less useful in a single threaded scenario.
Update: this has been partially or fully addressed in some later version of Hangfire.

Does firiing off an indefinite WAITFOR increase the log file size?

In the last release of my app, I added a command that tells it to wait when something arrives in the Service Broker queue
WAITFOR (RECEIVE CONVERT(int, message_body) AS Message FROM MyQueue)
The DBAs tell me that since the addition, the log sizes have gone through the roof. Could this be correct? Or should I be looking elsewhere?

I haven't tested this in service broker but I assume the same ACID compliance mechanisms would be in play. It would depend on if it's leaving a transaction open or not in your code. If it is leaving a transaction open and not committing it, the log will continue to grow until something closes it and only at that point will it finally mark the old areas for re-use.
I haven't rolled service broker in prod yet but the testing/reading I did did not include any WAITFOR.
Instead, the Server Broker MVPs like Denny Cherry would typically keep querying the queue instead of doing a WAITFOR.
Can you post some of the other code and also tell us why you're using WAITFOR? Maybe there's something I'm not getting that would be a good use case scenario. Thanks!

Worker process reached its allowed processing time

We are experiencing this issue approximately once a month. It is very hard to pinpoint the cause so any help would be appreciated. This causes the App pool to stop and brings the site down. We have gone through all log files and have concluded nothing. We are using the 2.0.3 version on IIS 6.

I've noticed IIS defaults web apps on a 29-hour recycle schedule, which can be troublesome since it may recycle at times your users do not expect it to.
For example: web app starts at 12 am, which means the next day it recycles at 5am, the day after that at 10am, the day after that at 3pm, etc. (this is assuming there is enough request activity against your app to keep it alive so it does not shutdown due to inactivity)
If your web app relies heavily on in-memory session state this is especially bad because the recycle will kill sessions and possibly force users to re-authenticate and lose any unsaved work. (if you don't design your app to work seamlessly with recycling)
Check the recycle schedule and make sure it recycles at a time that you expect. See this for screenshots: http://remy.supertext.ch/2010/08/iis7-worker-process-reached-its-allowed-processing-time-limit/
Not sure about the infinite loop suggestion... sounds like you just have a recycling configuration issue to resolve.

This likely indicates an infinite loop in your application code.
Basically, every time a request comes into the web server, IIS hands the request off to a worker process. You can configure in IIS how many of those workers there are, and what the timeout value is. The timeout is to keep things moving in case the application code hangs -- it gets killed so the thread can go back in the pool to keep servicing new requests.
So look through your code for likely infinite loops. Or alternatively, it could be an extremely long-running database query that could have eventually finished but exceeded the timeout value. Perhaps your web application offers the end user an opportunity to make too broad of a query that returns too much data or requires too much DB processing time.
It's hard to give a specific cause for you, of course, but try to think along these lines.

If you're experiencing a crash as a result (sounds like you are) then you might want to grab a copy of Debugging Tools for Windows and spend some time reading Tess Ferrandez' blog--she offers great advice on performing post mortem crash analysis and makes WinDbg a whole lot more approachable.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas