Hangfire Dispatcher stopped due to an exception - Azure Cloud MVC website - hangfire

We get these Hangfire error messages about 2 or 3 times a day at random times, even at night when there is no activity happening. These messages come at different times; they seem to be independent.
Execution Worker is in the Failed state now due to an exception,
execution will be retried no more than in 00:00:04
Dispatcher is stopped due to an exception, you need to restart the
server manually. Please report it to Hangfire developers.
We have Hangfire running in a website hosted in the Azure Cloud, and we have the “Always On” setting turned on.
It is an ASP.NET 4.61 MVC website.
Here is our configuration in Global.asax.cs:
HangfireAspNet.Use(GetHangfireServers);
.
.
.
private IEnumerable GetHangfireServers()
{
Hangfire.GlobalConfiguration.Configuration
.SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
.UseSimpleAssemblyNameTypeSerializer()
.UseRecommendedSerializerSettings()
.UseNinjectActivator(_kernel)
.UseNLogLogProvider()
.UseSqlServerStorage(“SQL connection string”, new SqlServerStorageOptions
{
CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
QueuePollInterval = TimeSpan.Zero,
UseRecommendedIsolationLevel = true,
DisableGlobalLocks = true
});
yield return new BackgroundJobServer();
}
We have this in the Owin Startup.cs file:
app.UseHangfireDashboard("/hangfire", new DashboardOptions
{
Authorization = Enumerable.Empty()
});
We have these packages installed:
Hangfire.AspNet version=0.2.0
Hangfire.Core version=1.7.28
Hangfire.Ninject version=1.2.0
Hangfire.SqlServer version=1.7.28
Our website (hosted in the Azure Cloud) has 5 webservers that are load-balanced, so we normally have 5 Hangfire servers running.
Does anyone have ideas on how we can solve this? Or any ideas on how we can troubleshoot this?
Thank You
Update
Here is the inner exception for the "Execution Worker" error.
Execution Worker is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04 System.Data.SqlClient.SqlException Login failed for user '...'. Token is expired.
A severe error occurred on the current command. The results, if any, should be discarded.
Any ideas on how we can refresh the Token? Maybe it expires if there is no activity? So maybe we need to have a scheduled job that will refresh the token?
The "Dispatcher Error" stopped happening. I think it may be related to the "ghost servers" that #jbl mentioned below. I'm not sure, but I think we got ghost servers during the deployment process. We deploy new code to a "deployment slot", and then the deployment slot swaps into the production slot. Somewhere in the deployment process some hangfire servers were not shutdown properly and they became ghost servers (that's my guess). What seemed to solve the issue is that we restart both slots after a deployment to make sure all the hangfire servers are shutdown.
Thank You

Related

Intermittent problems starting Azure App Services: "500.37 ANCM Failed to Start Within Startup Time Limit"

Our app services are experiencing the problem, that they can’t be restarted by the hosting environment (ANCM).
The user is getting the following screen in that case:
Http Error 500.37
Our production subscription consists of up to 8 different app services and the problem can randomly harm one of them ore some of them.
The problem can occur several times a week, or just once a month.
The bootstrapping procedure of our app services is not time consuming.
The last occurrence of the problem has this log entries within the eventlog:
Failed to gracefully shutdown application 'MACHINE/WEBROOT/APPHOST/XXXXXXXXX'.
followed by:
Application '/LM/W3SVC/815681839/ROOT' with physical root 'D:\home\site\wwwroot' failed to load coreclr. Exception message: Managed server didn't initialize after 120000 ms
In most cases the problem can be resolved by manually stopping and starting the app service. In some cases we had to do that twice.
We are not able to reproduce that behavior locally.
The App Service Plan is S2 and we actually use just one instance.
The documentation of the Http error 500.37 recommends:
"You may need to stagger the startup process of multiple apps."
But there is no hint of how to do that.
How can we ensure that our app services are restarted without errors.
HTTP Error 500.37 - ANCM Failed to Start Within Startup Time Limit
You can try following approaches:
Approach 1: If possible, can try to move one app into a new App Service with a separate App Service plan, then check whether it can start as expected.
Please note that creating and using a separate App Service plan would be charged.
Approach 2: Increasing the startupTimeLimit attribute of the aspNetCore element.
For more information about the startupTimeLimit attribute, please check: https://learn.microsoft.com/en-us/aspnet/core/host-and-deploy/aspnet-core-module?view=aspnetcore-3.1#attributes-of-the-aspnetcore-element

Function Timeout Issue when using app service plan

I have set up azure function app in app service plan. The azure function is made of of durable function with multiple activity triggers. When the data is huge, the last activity trigger times out with 404 gateway timeout error, but when I check the called method, it is still getting executing.
I read that app service plan has infinite timeout, even after that I added the functiontimeout setting in host.json.
Still no help. And the process keeps failing with gateway timeout error.

Websphere application server LDAP connection pool

We are using websphere application server 8.5.0.0. we have a requirement where we have to query a LDAP server to get the customer details. I tried to configure the connection pool as described here and here.
I passed the below JVM arguments
-Dcom.sun.jndi.ldap.connect.pool.maxsize=5
-Dcom.sun.jndi.ldap.connect.pool.timeout=60000
-Dcom.sun.jndi.ldap.connect.pool.debug=all
Below is a sample code snippet
Hashtable<String,String> env = new Hashtable<String,String>();
...
...
env.put("com.sun.jndi.ldap.connect.pool", "true");
env.put("com.sun.jndi.ldap.connect.timeout", "5000");
InitialDirContext c = new InitialDirContext(env);
...
...
c.close();
I have two issues here
When I am calling the service for the 6th time, I am getting javax.naming.ConnectionException: Timeout exceeded while waiting for a connection: 5000ms. I checked the connection pool debug logs and I noticed the connections are not returning back to the pool immediately despite closing the context safely in a finally block. The connections are released after some time and expired after sometime after the release. There after if I call the service again, it connects to the LDAP server but new connections are being created.
I tried to execute the code and I am able to see the connection pool debug logs. But the logs are being logged in System.Err log. Is this an issue? Can I ignore it?
But when I run the code as a standalone application(multithreaded with loop of 50 times), the connections are returned/released immediately.
Can anyone please let me know what am I doing wrong?

Topshelf Windows Service times out Error 7000 7009

I have a windows service programmed in vb.NET, using Topshelf as Service Host.
Once in a while the service doesn't start. On the event log, the SCM writes errors 7000 and 7009 (service did not respond in a timely fashion). I know this is a common issue, but I (think) I have tried everything with no result.
The service only relies in WMI, and has no time-consuming operations.
I read this question (Error 1053: the service did not respond to the start or control request in a timely fashion), but none of the answers worked for me.
I Tried:
Set topshelf's start timeout.
Request additional time in the first line of "OnStart" method.
Set a periodic timer wich request additional time to the SCM.
Remove TopShelf and make the service with the Visual Studio Service Template.
Move the initialization code and "OnStart" code to a new thread to return inmediately.
Build in RELEASE mode.
Set GeneratePublisherEvidence = false in the app.config file (per application).
Unchecked "Check for publisher’s certificate revocation" in the internet settings (per machine).
Deleted all Alternate Streams (in case some dll was marked as web and blocked).
Removed any "Debug code"
Increased Window's general service timeout to 120000ms.
Also:
The service doesn't try to communicate with the user's desktop in any way.
The UAC is disabled.
The Service Runs on LOCAL SYSTEM ACCOUNT.
I believe that the code of the service itself is not the problem because:
It has been on production for over two years.
Usually the service starts fine.
There is no exception logged in the Event Log.
The "On Error" options for the service dosn't get called (since the service doesn't actually fails, just doesn't respond to the SCM)
I've commented out almost everything on it, pursuing this error! ;-)
Any help is welcome since i'm completely out of ideas, and i've been strugling with this for over 15 days...
For me the 7009 error was produced by my NET core app because I was using this construct:
var builder = new ConfigurationBuilder()
.SetBasePath(Directory.GetCurrentDirectory())
.AddJsonFile("appsettings.json");
and appsettings.json file obviously couldn't be found in C:\WINDOWS\system32.. anyway, changing it to Path.Combine(AppContext.BaseDirectory, "appsettings.json") solved the issue.
More general help - for Topshelf you can add custom exception handling where I finally found some meaningfull error info, unlike event viewer:
HostFactory.Run(x => {
...
x.OnException(e =>
{
using (var fs = new StreamWriter(#"C:\log.txt"))
{
fs.WriteLine(e.ToString());
}
});
});
I've hit the 7000 and 7009 issue, which fails straight away (even though the error message says A timeout was reached (30000 milliseconds)) because of misconfiguration between TopShelf and what the service gets installed as.
The bottom line - what you pass in HostConfigurator.SetServiceName(name) needs to match exactly the SERVICE_NAME of the Windows service which gets installed.
If they don't match it'll fail straight away and you get the two event log messages.
I had this start happening to a service after Windows Creator's Edition update installed. Basically it made the whole computer slower, which is what I think triggered the problem. Even one of the Windows services had a timeout issue.
What I learned online is that the constructor for the service needs to be fast, but OnStart has more leeway with the SCM. My service had a C# wrapper and it included an InitializeComponent() that was called in the constructor. I moved that call to OnStart and the problem went away.

Task Persistence C#

I'm having a hard time trying to get my task to stay persistent and run indefinitely from a WCF service. I may be doing this the wrong way and am willing to take suggestions.
I have a task that starts to process any incoming requests that are dropped into a BlockingCollection. From what I understand, the GetConsumingEnumerable() method is supposed to allow me to persistently pull data as it arrives. It works with no problem by itself. I was able to process dozens of requests without a single error or flaw using a windows form to fill out the request and submit them. Once I was confident in this process I wired it up to my site via an asmx web service and used jQuery ajax calls to submit request.
The site submits request based on a url that is submitted, the Web Service downloads the html content from the url and looks for other urls within the content. It then proceeds to create a request for each url it finds and submits it to the BlockingCollection. Within the WCF service, if the application is Online (i.e. Task has started) - it pulls the request using the GetConsumingEnumerable via a Parallel.ForEach and Processes the request.
This works for the first few submissions, but then the task just stops unexpectedly. Of course, this is doing 10x more request than I could simulate in testing - but I expected it to just throttle. I believe the issue is in my method that starts the task:
public void Start()
{
Online = true;
Task.Factory.StartNew(() =>
{
tokenSource = new CancellationTokenSource();
CancellationToken token = tokenSource.Token;
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = 20;
options.CancellationToken = token;
try
{
Parallel.ForEach(FixedWidthQueue.GetConsumingEnumerable(token), options, (request) =>
{
Process(request);
options.CancellationToken.ThrowIfCancellationRequested();
});
}
catch (OperationCanceledException e)
{
Console.WriteLine(e.Message);
return;
}
}, TaskCreationOptions.LongRunning);
}
I've thought about moving this into a WF4 Service and just wire it up in a Workflow and use Workflow Persistence, but am not willing to learn WF4 unless necessary. Please let me know if more information is needed.
The code you have shown is correct by itself.
However there are a few things that can go wrong:
If an exception occurs, your task stops (of course). Try adding a try-catch and log the exception.
If you start worker threads in a hosted environment (ASP.NET, WCF, SQL Server) the host can decide arbitrarily (without reason) to shut down any worker process. For example, if your ASP.NET site is inactive for some time the app is shut down. The hosts that I just mentioned are not made to have custom threads running. Probably, you will have more success using a dedicated application (.exe) or even a Windows Service.
It turns out the cause of this issue was with the WCF Binding Configuration. The task suddenly stopped becasue the WCF killed the connection due to a open timeout. The open timeout setting is the time that a request will wait for the service to open a connection before timing out. In certain situations, it reached the limit of 10 max connection and caused the incomming connections to get backed up waiting for a connection. I made sure that I closed all connections to the host after the transactions were complete - so I gave in to upping the max connections and the open timeout period. After this - it ran flawlessly.