Is there a way to recover from an unhandled exception that doesn't involve cancelling, terminating, or aborting a Workflow?
What I'd like to do is have the Workflow restart or simply log the exception if possible. My workflow is long running and hosted in a WorkflowApplication, which is in a Windows Service.
As of right now, if unhandled exception are experienced, the service is in the "started" state but my workflow is dead in the water and I'd like to possibly "kick-start" the workflow back into action, even if it has to completely restart its sequence.
Is compensation desirable in this scenario?
If you use workflow persistence and abort the workflow will be able to restart from the last saved state in the store. Adding Persist activities in strategic places in your workflow will make sure that the past saved state is a god point to restart.
Note that with the WorkflowApplication as a host you have to reload it yourself. The best way would be to add a callback to the Aborted property that is fired when the workflow aborts. There you create a new WorkflowApplication and load the same workflow instance and resume it.
Related
I have a WCF application which consists in some async communications with ecternal services. When we start a new expedient, a new instance is created; it process data and send an xml to a external service and waits for the response. This response requires that a person review the xml and send the response so it usually it is delayed for a long time. For this reason, the workflow go to idle and we use persistence with AppFabric.
The fact is that sometime, when we receive the response, the next exception is raised:
The execution of the InstancePersistenceCommand named {urn:schemas-microsoft-com:System.Activities.Persistence/command}LoadWorkflowByInstanceKey was interrupted by an error.
Normally this error does not occur, it can occur very sporadically. However, we are trying to update the app to include a new functionality (it does not modify the workflow) but when the application is deployed to the server, the instances that were created with the old deployment and were waiting for the response, throw this exception when they receive the response from the external service. However, the instances initiated with the new deployment process the response without problem.
I have been looking for information about this problem but I haven't found much. Anybody can help me?
SOLUTION:
Thanks a lot for your answer, it may be helpful for me in the future. In this case, the problem was that I was updating an assembly version of one of the implicated project (to upload a nuget package) and for a reason that I don’t understand, the instances created with an old version raised this exception when the service with the new version had to manipulate the mentioned instances.
If I change the assembly version to upload the nuget and then set the original version and deploy with this version, everything works ok. Anybody knows what is the reason?
Thanks a lot.
This may be because there is a program running in the background and trying to extend the lock on the instance store every 30 seconds, and it seems that whenever the connection to the SQL service fails, it marks the instance store as invalid.
You can try <workflowIdle timeToUnload="0"/>, if it doesn't work you can look at the methods provided by other links.
Windows workflow 4.0 InstancePersistenceCommand Error
Why do I get exception "The execution of the InstancePersistenceCommand named LoadWorkflowByInstanceKey was interrupted by an error"
WF4 InstancePersistenceCommand interrupted
Our app services are experiencing the problem, that they can’t be restarted by the hosting environment (ANCM).
The user is getting the following screen in that case:
Http Error 500.37
Our production subscription consists of up to 8 different app services and the problem can randomly harm one of them ore some of them.
The problem can occur several times a week, or just once a month.
The bootstrapping procedure of our app services is not time consuming.
The last occurrence of the problem has this log entries within the eventlog:
Failed to gracefully shutdown application 'MACHINE/WEBROOT/APPHOST/XXXXXXXXX'.
followed by:
Application '/LM/W3SVC/815681839/ROOT' with physical root 'D:\home\site\wwwroot' failed to load coreclr. Exception message: Managed server didn't initialize after 120000 ms
In most cases the problem can be resolved by manually stopping and starting the app service. In some cases we had to do that twice.
We are not able to reproduce that behavior locally.
The App Service Plan is S2 and we actually use just one instance.
The documentation of the Http error 500.37 recommends:
"You may need to stagger the startup process of multiple apps."
But there is no hint of how to do that.
How can we ensure that our app services are restarted without errors.
HTTP Error 500.37 - ANCM Failed to Start Within Startup Time Limit
You can try following approaches:
Approach 1: If possible, can try to move one app into a new App Service with a separate App Service plan, then check whether it can start as expected.
Please note that creating and using a separate App Service plan would be charged.
Approach 2: Increasing the startupTimeLimit attribute of the aspNetCore element.
For more information about the startupTimeLimit attribute, please check: https://learn.microsoft.com/en-us/aspnet/core/host-and-deploy/aspnet-core-module?view=aspnetcore-3.1#attributes-of-the-aspnetcore-element
Our web app process is restarting regularly and we are unable to determine the reason.
When looking into Application Events (using the 'Diagnostics and solve problems' blade in the Azure Portal), there exists a bunch of the following Info logs by 'IIS AspNetCore Module'
Event ID 1005:
Failed to gracefully shutdown process '14040'.
Event ID 1001:
Application 'MACHINE/WEBROOT/APPHOST/myapplication__xxxx' started process '31628' successfully and is listening on port '17663'.
There is nothing fishy with general resource usage and nothing in our application logs.
What is the best way to troubleshoot the reason behind these process restarts?
EDIT 1:
After fiddling around with web logging in the Web App's Diagnostic Logs, I now get an error logged from W3SVC-WP after each restart, but the message is nonsense:
1<br/>5<br/>50000780
EDIT 2:
Event Id 2284 refers to this:
FailedRequestTracing module failed to write buffered events to log
file for the request that matched failure definition. No logs will be
generated until this condition is corrected. The problem happened at
least %1 times in the last %2 minutes. The data is the error.
I'm not sure if this could be related to our Diagnostic Logs configuration, but seems unlikely.
EDIT 3:
As per Brando Zhang's suggestion, I've used the Web App Crash Diagnoser extension and tried monitoring 2nd Chance Unhandled Exceptions on both my application process AND on w3wp, but nothing is dumped.
From how I understand it, 1st Chance Exceptions will not crash the process, so no need to monitor these.
Very likely application is crashing due to fatal exception and causing the restarts.
On Azure App Service platform.You can use the Diagnostics as a
Service (DaaS) to troubleshoot this
It can also do an analysis and tell you the root cause most of the time.More step by step infofrmation can be found on this msdn blog .Also refer tips for using crash diagnoser
I have a windows service programmed in vb.NET, using Topshelf as Service Host.
Once in a while the service doesn't start. On the event log, the SCM writes errors 7000 and 7009 (service did not respond in a timely fashion). I know this is a common issue, but I (think) I have tried everything with no result.
The service only relies in WMI, and has no time-consuming operations.
I read this question (Error 1053: the service did not respond to the start or control request in a timely fashion), but none of the answers worked for me.
I Tried:
Set topshelf's start timeout.
Request additional time in the first line of "OnStart" method.
Set a periodic timer wich request additional time to the SCM.
Remove TopShelf and make the service with the Visual Studio Service Template.
Move the initialization code and "OnStart" code to a new thread to return inmediately.
Build in RELEASE mode.
Set GeneratePublisherEvidence = false in the app.config file (per application).
Unchecked "Check for publisher’s certificate revocation" in the internet settings (per machine).
Deleted all Alternate Streams (in case some dll was marked as web and blocked).
Removed any "Debug code"
Increased Window's general service timeout to 120000ms.
Also:
The service doesn't try to communicate with the user's desktop in any way.
The UAC is disabled.
The Service Runs on LOCAL SYSTEM ACCOUNT.
I believe that the code of the service itself is not the problem because:
It has been on production for over two years.
Usually the service starts fine.
There is no exception logged in the Event Log.
The "On Error" options for the service dosn't get called (since the service doesn't actually fails, just doesn't respond to the SCM)
I've commented out almost everything on it, pursuing this error! ;-)
Any help is welcome since i'm completely out of ideas, and i've been strugling with this for over 15 days...
For me the 7009 error was produced by my NET core app because I was using this construct:
var builder = new ConfigurationBuilder()
.SetBasePath(Directory.GetCurrentDirectory())
.AddJsonFile("appsettings.json");
and appsettings.json file obviously couldn't be found in C:\WINDOWS\system32.. anyway, changing it to Path.Combine(AppContext.BaseDirectory, "appsettings.json") solved the issue.
More general help - for Topshelf you can add custom exception handling where I finally found some meaningfull error info, unlike event viewer:
HostFactory.Run(x => {
...
x.OnException(e =>
{
using (var fs = new StreamWriter(#"C:\log.txt"))
{
fs.WriteLine(e.ToString());
}
});
});
I've hit the 7000 and 7009 issue, which fails straight away (even though the error message says A timeout was reached (30000 milliseconds)) because of misconfiguration between TopShelf and what the service gets installed as.
The bottom line - what you pass in HostConfigurator.SetServiceName(name) needs to match exactly the SERVICE_NAME of the Windows service which gets installed.
If they don't match it'll fail straight away and you get the two event log messages.
I had this start happening to a service after Windows Creator's Edition update installed. Basically it made the whole computer slower, which is what I think triggered the problem. Even one of the Windows services had a timeout issue.
What I learned online is that the constructor for the service needs to be fast, but OnStart has more leeway with the SCM. My service had a C# wrapper and it included an InitializeComponent() that was called in the constructor. I moved that call to OnStart and the problem went away.
I have a service that scans network folders using a parallel.for method.
However recently I am finding if I stop the service then while windows says the service is stopped the process is still running in task manager. However it is at 0 cpu and the memory does not change. If I try and end the task (even a force in command prompt) it just says access denied and i have to reboot the server.
What would be the best way to make sure everything terminates?
I thought of adding a global Boolean that in the stop procedure it turns true and part of my parallel code will check for that and call s.stop.
Thank you
In brief, when your service is stopped, it needs to cancel all pending and running operations, then wait for those operation to actually finish. Check out the MSDN reference for Task Cancellation.