Recently one issue is really challenging me that I already wasted almost 15 days to figure out root cause. But unfortunately, so far there is no luck.
Here are the details,
As part of the feature we provide in our application to our customers, customers can queue up list of jobs and let it run in background and let customer know once the job they queued finished.
Each job supposed to execute ~100K mdx queries to complete with success result. But behind the scene, our engine divide that 100K queries into smaller chunks and create jobs for each chunk with less amount of queries. In this case small jobs are dealing with 1000 queries. With this rough numbers, I can tell that engine is creating 100 additional jobs. Our engine then starts executing those small chunks one by one.
And in each job execution it runs RunAndParseQueryResult method in the following code.
class Snippet
{
public static void RunAndParseQueryResult()
{
DataTable result = new DataTable();
IDbConnection conn = ConnectionFactory.CreateConnection();
conn.Open();
foreach (string mdxQuery in queryList)
{
ExecuteMdxQuery(IDbConnection connection, DataTable result, string mdxQuery)
}
conn.Close();
conn.Dispose();
}
private void ExecuteMdxQuery(IDbConnection connection, DataTable result, string mdxQuery)
{
var conn = connection as AdomdConnection;
Trace.TraceInformation("Log 1");
if (conn != null)
{
Trace.TraceInformation("Log 2");
using (AdomdCommand command = new AdomdCommand(mdxFile.Mdx, conn) { CommandTimeout = 5000 })
{
Trace.TraceInformation("Log 3");
using (AdomdDataAdapter adapter = new AdomdDataAdapter(command))
{
Trace.TraceInformation("Log 4");
try
{
DataTable dt = new DataTable();
Trace.TraceInformation("Log 5");
dt.BeginLoadData();
Trace.TraceInformation("Log 6");
adapter.Fill(dt);
Trace.TraceInformation("Log 7");
dt.EndLoadData();
Trace.TraceInformation("Log 8");
if (dt.Rows.Any())
{
Trace.TraceInformation("Log 9");
ParseQueryResult(result, mdxFile, dt);
Trace.TraceInformation("Log 10");
}
}
catch (Exception exception)
{
Trace.TraceInformation("Log 11");
throw;
}
}
}
}
else
{
throw new Exception("Given cube connection can not be casted to AdoMdConnection");
}
}
}
As you can see, RunAndParseQueryResult method opens connection and pass it to ExecuteMdxQuery method along with mdxQuery loop variable.
In the ExecuteMdxQuery method, almost after every line , I have put a log using Trace.TraceInformation method.
What happens is at certain iteration ExecuteMdxQuery method stops at
adapter.Fill(dt);
method. I am figuring out this by looking at the logs. Because if it was executed successfully the I would have seen log like "Log 7" or if it failed executing it I should be able to see "Log 11". But none of those lines seems to be run.
When I run the query manually it is working fine. The query is definitely not long running query, and even it was, we have specified the timeout 5000 second in
AdomdCommand command = new AdomdCommand(mdxFile.Mdx, conn) { CommandTimeout = 5000 }
code and it suppose to throw an TimeOutException normally. But it is not.
Any opinion why this could be ?
Thank you in advance.
After a lot of effort and being patience for long time, we were able to understand the root cause of this issue with help of Microsoft support.
In the end, issue turned out to be related to CommandTimeout property behavior in AdoMD.Net.
CommandTimeout setting does not work as expected in case of network failure.
To identify the issue, we had to create the dump of client application at the time application hangs and no more executing query.
With WinDbg tool, we found out that, at the time when hang is happening application is waiting on some methods which is under System.Net namespace.
Following is the stack trace we got out of dump file.
a15e18d548 7ffa07f306fa [InlinedCallFrame: 000000a15e18d548] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
a15e18d548 7ff99b7757af [InlinedCallFrame: 000000a15e18d548] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
a15e18d520 7ff99b7757af DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
a15e18d5f0 7ff99b7761be System.Net.Sockets.Socket.Receive(Byte[], Int32, Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError ByRef)
a15e18d690 7ff99b775d35 System.Net.Sockets.NetworkStream.Read(Byte[], Int32, Int32)
a15e18d710 7ff99b775be9 System.Net.FixedSizeReader.ReadPacket(Byte[], Int32, Int32)
a15e18d760 7ff99b782839 System.Net.Security._SslStream.StartFrameHeader(Byte[], Int32, Int32, System.Net.AsyncProtocolRequest)
a15e18d7d0 7ff99b78237a System.Net.Security._SslStream.StartReading(Byte[], Int32, Int32, System.Net.AsyncProtocolRequest)
a15e18d850 7ff99b781fab System.Net.Security._SslStream.ProcessRead(Byte[], Int32, Int32, System.Net.AsyncProtocolRequest)
a15e18d8d0 7ff99b781d6e System.Net.TlsStream.Read(Byte[], Int32, Int32)
a15e18d960 7ff99b781c93 System.Net.PooledStream.Read(Byte[], Int32, Int32)
a15e18d990 7ff99b781499 System.Net.Connection.SyncRead(System.Net.HttpWebRequest, Boolean, Boolean)
a15e18da20 7ff99b76d694 System.Net.ConnectStream.WriteHeaders(Boolean)
a15e18dad0 7ff99b76bb44 System.Net.HttpWebRequest.EndSubmitRequest()
a15e18db20 7ff99b762e58 System.Net.Connection.SubmitRequest(System.Net.HttpWebRequest, Boolean)
a15e18dbb0 7ff99b76120e System.Net.ServicePoint.SubmitRequest(System.Net.HttpWebRequest, System.String)
a15e18dc20 7ff99b60eda4 System.Net.HttpWebRequest.SubmitRequest(System.Net.ServicePoint)
a15e18dc80 7ff99b60ded1 System.Net.HttpWebRequest.GetResponse()
a15e18dd30 7ff99d474f7c Microsoft.AnalysisServices.AdomdClient.HttpStream.GetResponseStream()
a15e18de20 7ff99d474ac0 Microsoft.AnalysisServices.AdomdClient.HttpStream.GetResponseDataType()
a15e18de90 7ff99d4743cc Microsoft.AnalysisServices.AdomdClient.CompressedStream.GetResponseDataType()
a15e18def0 7ff99d470745 Microsoft.AnalysisServices.AdomdClient.XmlaClient.EndRequest()
a15e18df70 7ff99d4702f7 Microsoft.AnalysisServices.AdomdClient.XmlaClient.SendMessage(Boolean, Boolean, Boolean)
a15e18dfe0 7ff99d558f0a Microsoft.AnalysisServices.AdomdClient.XmlaClient.ExecuteStatement(System.String, System.Collections.IDictionary, System.Collections.IDictionary, System.Data.IDataParameterCollection, Boolean)
a15e18e040 7ff99d558037 Microsoft.AnalysisServices.AdomdClient.AdomdConnection+XmlaClientProvider.Microsoft.AnalysisServices.AdomdClient.IExecuteProvider.ExecuteTabular(System.Data.CommandBehavior, Microsoft.AnalysisServices.AdomdClient.ICommandContentProvider, Microsoft.AnalysisServices.AdomdClient.AdomdPropertyCollection, System.Data.IDataParameterCollection)
a15e18e0e0 7ff99d557b28 Microsoft.AnalysisServices.AdomdClient.AdomdCommand.ExecuteReader(System.Data.CommandBehavior)
As you can see from the stack traces hang is happening in Receive methods in
System.Net.Sockets.Socket
namespace.
After this we decided to capture network traces in both Analysis Service and client application side. When we capture traces on both end, interestingly, hang issue is no more happening. Then we only captured the network traces on SSAS server.
As you can see from above screenshot, SSAS tried re-transmitting the response 9 times and finally timed-out and disconnected the TCP communication.
This lead us to think that client loosing the network connection for sometimes. While client is trying to re-establish the network connection, result has been prepared already on SSAS side and SSAS is trying to transfer results to client. As seen from network traces, it tried 9 times in my example. In the end it gave up more trial. What is happening is that, During that 9 times trial, if client establishes the network connections back, then everything is moving forward as expected, but if client establishes the connection after retry ends, then client continue to wait forever although server has already attempted to respond to client. At this point, as a client, we would expect it to at least respect to CommandTimeout property. And fail the execution when CommandTimeout is reached. But it does not do that. This issue has been submitted to product team already.
As a workaround we realized that, there is another connection string property
Timeout : Specifies how long (in seconds) the client library waits for a command to complete before generating an error. link
When we set the Timeout in case of network failure, connection timeout is happening and hence query execution is failing. Client is never hanging. Although, connection loss is not something we would like to have in case of command timeout, but this is what we have until CommandTimout property behavior is fixed by Product team.
Thank you.
Related
Problem Statement
Context
I'm a Software Engineer in Test running order permutations of Restaurant Menu Items to confirm that they succeed order placement w/ the POS
In short, this POSTs a JSON payload to an endpoint which then validates the order w/ a POS to define success/fail/other
Where POS, and therefore Transactions per Second (TPS), may vary, but each Back End uses the same core handling
This can be as high as ~22,000 permutations per item, in easily manageable JSON size, that need to be handled as quickly as possible
The Network can vary wildly depending upon the Restaurant, and/or Region, one is testing
E.g. where some have a much higher latency than others
Therefore, the HTTPClient should be able to intelligently negotiate the same content & endpoint regardless of this
Direct Problem
I'm using Apache's HTTP Client 5 w/ PoolingAsyncClientConnectionManager to execute both the GET for the Menu contents, and the POST to check if the order succeeds
This works out of the box, but sometimes loses connections w/ Stream Refused, specifically:
org.apache.hc.core5.http2.H2StreamResetException: Stream refused
No individual tuning seems to work across all network contexts w/ variable latency, that I can find
Following the stacktrace seems to indicate it is that the stream had closed already, therefore needs a way to keep it open or not execute an already-closed connection
if (connState == ConnectionHandshake.GRACEFUL_SHUTDOWN) {
throw new H2StreamResetException(H2Error.PROTOCOL_ERROR, "Stream refused");
}
Some Attempts to Fix Problem
Tried to use Search Engines to find answers but there are few hits for HTTPClient5
Tried to use official documentation but this is sparse
Changing max connections per route to a reduced number, shifting inactivity validations, or connection time to live
Where the inactivity checks may fix the POST, but stall the GET for some transactions
And that tuning for one region/restaurant may work for 1 then break for another, w/ only the Network as variable
PoolingAsyncClientConnectionManagerBuilder builder = PoolingAsyncClientConnectionManagerBuilder
.create()
.setTlsStrategy(getTlsStrategy())
.setMaxConnPerRoute(12)
.setMaxConnTotal(12)
.setValidateAfterInactivity(TimeValue.ofMilliseconds(1000))
.setConnectionTimeToLive(TimeValue.ofMinutes(2))
.build();
Shifting to a custom RequestConfig w/ different timeouts
private HttpClientContext getHttpClientContext() {
RequestConfig requestConfig = RequestConfig.custom()
.setConnectTimeout(Timeout.of(10, TimeUnit.SECONDS))
.setResponseTimeout(Timeout.of(10, TimeUnit.SECONDS))
.build();
HttpClientContext httpContext = HttpClientContext.create();
httpContext.setRequestConfig(requestConfig);
return httpContext;
}
Initial Code Segments for Analysis
(In addition to the above segments w/ change attempts)
Wrapper handling to init and get response
public SimpleHttpResponse getFullResponse(String url, PoolingAsyncClientConnectionManager manager, SimpleHttpRequest req) {
try (CloseableHttpAsyncClient httpclient = getHTTPClientInstance(manager)) {
httpclient.start();
CountDownLatch latch = new CountDownLatch(1);
long startTime = System.currentTimeMillis();
Future<SimpleHttpResponse> future = getHTTPResponse(url, httpclient, latch, startTime, req);
latch.await();
return future.get();
} catch (IOException | InterruptedException | ExecutionException e) {
e.printStackTrace();
return new SimpleHttpResponse(999, CommonUtils.getExceptionAsMap(e).toString());
}
}
With actual handler and probing code
private Future<SimpleHttpResponse> getHTTPResponse(String url, CloseableHttpAsyncClient httpclient, CountDownLatch latch, long startTime, SimpleHttpRequest req) {
return httpclient.execute(req, getHttpContext(), new FutureCallback<SimpleHttpResponse>() {
#Override
public void completed(SimpleHttpResponse response) {
latch.countDown();
logger.info("[{}][{}ms] - {}", response.getCode(), getTotalTime(startTime), url);
}
#Override
public void failed(Exception e) {
latch.countDown();
logger.error("[{}ms] - {} - {}", getTotalTime(startTime), url, e);
}
#Override
public void cancelled() {
latch.countDown();
logger.error("[{}ms] - request cancelled for {}", getTotalTime(startTime), url);
}
});
}
Direct Question
Is there a way to configure the client such that it can handle for these variances on its own without explicitly modifying the configuration for each endpoint context?
Fixed w/ Combination of the below to Assure Connection Live/Ready
(Or at least is stable)
Forcing HTTP 1
HttpAsyncClients.custom()
.setConnectionManager(manager)
.setRetryStrategy(getRetryStrategy())
.setVersionPolicy(HttpVersionPolicy.FORCE_HTTP_1)
.setConnectionManagerShared(true);
Setting Effective Headers for POST
Specifically the close header
req.setHeader("Connection", "close, TE");
Note: Inactivity check helps, but still sometimes gets refusals w/o this
Setting Inactivity Checks by Type
Set POSTs to validate immediately after inactivity
Note: Using 1000 for both caused a high drop rate for some systems
PoolingAsyncClientConnectionManagerBuilder
.create()
.setValidateAfterInactivity(TimeValue.ofMilliseconds(0))
Set GET to validate after 1s
PoolingAsyncClientConnectionManagerBuilder
.create()
.setValidateAfterInactivity(TimeValue.ofMilliseconds(1000))
Given the Error Context
Tracing the connection problem in stacktrace to AbstractH2StreamMultiplexer
Shows ConnectionHandshake.GRACEFUL_SHUTDOWN as triggering the stream refusal
if (connState == ConnectionHandshake.GRACEFUL_SHUTDOWN) {
throw new H2StreamResetException(H2Error.PROTOCOL_ERROR, "Stream refused");
}
Which corresponds to
connState = streamMap.isEmpty() ? ConnectionHandshake.SHUTDOWN : ConnectionHandshake.GRACEFUL_SHUTDOWN;
Reasoning
If I'm understanding correctly:
The connections were being un/intentionally closed
However, they were not being confirmed ready before executing again
Which caused it to fail because the stream was not viable
Therefore the fix works because (it seems)
Given Forcing HTTP1 allows for a single context to manage
Where HttpVersionPolicy NEGOTIATE/FORCE_HTTP_2 had greater or equivalent failures across the spectrum of regions/menus
And it assures that all connections are valid before use
And POSTs are always closed due to the close header, which is unavailable to HTTP2
Therefore
GET is checked for validity w/ reasonable periodicity
POST is checked every time, and since it is forcibly closed, it is re-acquired before execution
Which leaves no room for unexpected closures
And otherwise the potential that it was incorrectly switching to HTTP2
Will accept this until a better answer comes along, as this is stable but sub-optimal.
If I'm connected to RabbitMQ and listening for events using an EventingBasicConsumer, how can I tell if I've been disconnected from the server?
I know there is a Shutdown event, but it doesn't fire if I unplug my network cable to simulate a failure.
I've also tried the ModelShutdown event, and CallbackException on the model but none seem to work.
EDIT-----
The one I marked as the answer is correct, but it was only part of the solution for me. There is also HeartBeat functionality built into RabbitMQ. The server specifies it in the configuration file. It defaults to 10 minutes but of course you can change that.
The client can also request a different interval for the heartbeat by setting the RequestedHeartbeat value on the ConnectionFactory instance.
I'm guessing that you're using the C# library? (but even so I think the others have a similar event).
You can do the following:
public class MyRabbitConsumer
{
private IConnection connection;
public void Connect()
{
connection = CreateAndOpenConnection();
connection.ConnectionShutdown += connection_ConnectionShutdown;
}
public IConnection CreateAndOpenConnection() { ... }
private void connection_ConnectionShutdown(IConnection connection, ShutdownEventArgs reason)
{
}
}
This is an example of it, but the marked answer is what lead me to this.
var factory = new ConnectionFactory
{
HostName = "MY_HOST_NAME",
UserName = "USERNAME",
Password = "PASSWORD",
RequestedHeartbeat = 30
};
using (var connection = factory.CreateConnection())
{
connection.ConnectionShutdown += (o, e) =>
{
//handle disconnect
};
using (var model = connection.CreateModel())
{
model.ExchangeDeclare(EXCHANGE_NAME, "topic");
var queueName = model.QueueDeclare();
model.QueueBind(queueName, EXCHANGE_NAME, "#");
var consumer = new QueueingBasicConsumer(model);
model.BasicConsume(queueName, true, consumer);
while (!stop)
{
BasicDeliverEventArgs args;
consumer.Queue.Dequeue(5000, out args);
if (stop) return;
if (args == null) continue;
if (args.Body.Length == 0) continue;
Task.Factory.StartNew(() =>
{
//Do work here on different thread then this one
}, TaskCreationOptions.PreferFairness);
}
}
}
A few things to note about this.
I'm using # for the topic. This grabs everything. Usually you want to limit by a topic.
I'm setting a variable called "stop" to determine when the process should end. You'll notice the loop runs forever until that variable is true.
The Dequeue waits 5 seconds then leaves without getting data if there is no new message. This is to ensure we listen for that stop variable and actually quit at some point. Change the value to your liking.
When a message comes in I spawn the handling code on a new thread. The current thread is being reserved for just listening to the rabbitmq messages and if a handler takes too long to process I don't want it slowing down the other messages. You may or may not need this depending on your implementation. Be careful however writing the code to handle the messages. If it takes a minute to run and your getting messages at sub-second times you will run out of memory or at least into severe performance issues.
We have a simple wpf application that connects to a service running on the local machine. We use a named pipe for the connection and then register a callback so that later the service can send updates to the client.
The problem is that with each call of the callback we get a build up of memory in the client application.
This is how the client connects to the service.
const string url = "net.pipe://localhost/radal";
_channelFactory = new DuplexChannelFactory<IRadalService>(this, new NetNamedPipeBinding(),url);
and then in a threadpool thread we loop doing the following until we are connected
var service = _channelFactory.CreateChannel();
service.Register();
service.Register looks like this on the server side
public void Register()
{
_callback = OperationContext.Current.GetCallbackChannel<IRadalCallback>();
OperationContext.Current.Channel.Faulted += (sender, args) => Dispose();
OperationContext.Current.Channel.Closed += (sender, args) => Dispose();
}
This callback is stored and when new data arrives we invoke the following on the server side.
void Sensors_OnSensorReading(object sender, SensorReadingEventArgs e)
{
_callback.OnReadingReceived(e.SensorId, e.Count);
}
Where the parameters are an int and a double. On the client this is handled as follows.
public void OnReadingReceived(int sensorId, double count)
{
_events.Publish(new SensorReadingEvent(sensorId, count));
}
But we have found that commenting out _event.Publish... makes no difference to the memory usage. Does anyone see any logical reason why this might be leaking memory. We have used a profiler to track the problem to this point but cannot find what type of object is building up.
Well I can partially answer this now. The problem is partially caused by us trying to be clever and getting the connection to be opened on another thread and then passing it back to the main gui thread. The solution was to not use a thread but instead use a dispatch timer. It does have the downside that the initial data load is now on the GUI thread but we are not loading all that much anyway.
However this was not the entire solution (actually we don't have an entire solution). Once we moved over to a better profiler we found out that the objects building up were timeout handlers so we disabled that feature. That's OK for us as we are running against the localhost always but I can imagine for people working with remote services it would be an issue.
When writing data to a web server, my tests show HttpWebRequest.ReadWriteTimeout is ignored, contrary to the MSDN spec. For example if I set ReadWriteTimeout to 1 (=1 msec), call myRequestStream.Write() passing in a buffer that takes 10 seconds to transfer, it transfers successfully and never times out using .NET 3.5 SP1. The same test running on Mono 2.6 times out immediately as expected. What could be wrong?
There appears to be a bug where the write timeout, when set on the Stream instance returned to you by BeginGetRequestStream(), is not propagated down to the native socket. I will be filing a bug to make sure this issue is corrected for a future release of the .NET Framework.
Here is a workaround.
private static void SetRequestStreamWriteTimeout(Stream requestStream, int timeout)
{
// Work around a framework bug where the request stream write timeout doesn't make it
// to the socket. The "m_Chunked" field indicates we are performing chunked reads. Since
// this stream is being used for writes, the value of this field is irrelevant except
// that setting it to true causes the Eof property on the ConnectStream object to evaluate
// to false. The code responsible for setting the socket option short-circuits when it
// sees Eof is true, and does not set the flag. If Eof is false, the write timeout
// propagates to the native socket correctly.
if (!s_requestStreamWriteTimeoutWorkaroundFailed)
{
try
{
Type connectStreamType = requestStream.GetType();
FieldInfo fieldInfo = connectStreamType.GetField("m_Chunked", BindingFlags.NonPublic | BindingFlags.Instance);
fieldInfo.SetValue(requestStream, true);
}
catch (Exception)
{
s_requestStreamWriteTimeoutWorkaroundFailed = true;
}
}
requestStream.WriteTimeout = timeout;
}
private static bool s_requestStreamWriteTimeoutWorkaroundFailed;
We're using WCF to build a simple web service which our product uses to upload large files over a WAN link. It's supposed to be a simple HTTP PUT, and it's working fine for the most part.
Here's a simplified version of the service contract:
[ServiceContract, XmlSerializerFormat]
public interface IReplicationWebService
{
[OperationContract]
[WebInvoke(Method = "PUT", UriTemplate = "agents/{sourceName}/epoch/{guid}/{number}/{type}")]
ReplayResult PutEpochFile(string sourceName, string guid, string number, string type, Stream stream);
}
In the implementation of this contract, we read data from stream and write it out to a file. This works great, so we added some error handling for cases when there's not enough disk space to store the file. Here's roughly what it looks like:
public ReplayResult PutEpochFile(string sourceName, string guid, string number, string type, Stream inStream)
{
//Stuff snipped
try
{
//Read from the stream and write to the file
}
catch (IOException ioe)
{
//IOException may mean no disk space
try
{
inStream.Close();
}
// if instream caused the IOException, close may throw
catch
{
}
_logger.Debug(ioe.ToString());
throw new FaultException<IOException>(ioe, new FaultReason(ioe.Message), new FaultCode("IO"));
}
}
To test this, I'm sending a 100GB file to a server that doesn't have enough space for the file. As expected this throws an exception, but the call to inStream.Close() appeared to hang. I checked into it, and what's actually happening is that the call to Close() made its way through the WCF plumbing until it reached System.ServiceModel.Channels.DrainOnCloseStream.Close(), which according to Reflector allocates a Byte[] buffer and keeps reading from the stream until it's at EOF.
In other words, the Close call is reading the entire 100GB of test data from the stream before returning!
Now it may be that I don't need to call Close() on this stream. If that's the case I'd like an explanation as to why. But more importantly, I'd appreciate it if anyone could explain to me why Close() is behaving this way, why it's not considered a bug, and how to reconfigure WCF so that doesn't happen.
.Close() is intended to be a "safe" and "friendly" way of stopping your operation - and it will indeed complete the currently running requests before shutting down - by design.
If you want to throw down the sledgehammer, use .Abort() on your client proxy (or service host) instead. That just shuts down everything without checking and without being nice about waiting for operations to complete.