Is there a way to make the Kafka consumer poll-function throw errors, rather than the library handling them internally? - kotlin

I'm working on a Kafka consumer in kotlin/javalin, using the standard kafka library org.apache.kafka.clients.consumer, and struggling a bit with the poll function, as it seems to never throw any errors that can be caught, it just writes warn/errors to the console. For example, when it's not able to reach the broker, it logges a warning that "Connection to node -1 could not be established. Broker may not be available.":
{
"timestamp": "2022-12-14T13:30:58.673+01:00",
"level": "WARN",
"thread": "main",
"logger": "org.apache.kafka.clients.NetworkClient",
"message": "[Consumer clientId=xxx, groupId=xxx] Connection to node -1 (localhost/127.0.0.1:1000) could not be established. Broker may not be available."
}
But it doesn't actually throw any errors, so it's pretty much impossible to handle the error, if you would like to do anything other than just continue to poll forever. Does anyone know if there is some way to configure this behavior? Or am I missing something?
The relevant code
consumer = createConsumer() // This returns a Consumer<String?, String?>
consumer.subscribe(listOf(TOPIC))
while (true) {
val records = consumer.poll(Duration.ofSeconds(1))
records.iterator().forEach {
println(it.key())
}
consumer.commitSync() // Commit offset after finished processing entries
}
I can trigger a timeout-error if I call the partitionsFor-function from the consumer, so this can work as a liveness-probe, but this feels more like a hack than the intended way to do it.
try {
var committed = consumer.partitionsFor(TOPIC)
} catch (e: Exception) {
println(e)
}
Thanks!

The client is dumb, and expects you to provide the correct values.
You can use AdminClient.describeCluster() with the same address to verify connection, then catch/throw RuntimeException from that.
Otherwise, the consumer will retry and update the metadata for your bootstrap.servers until it can connect.

Related

RabbitMQPublisher sometimes fails to recover

I am using Vertx 4.2.1 with RabbitMQ client and I just noticed that sometimes when the rabbitMQ client loose connection and reconnect, the RabbitMQPublisher is not able to publish messages anymore. It means that my call to publisherClient.rxPublish(...) never completes and it does not throw any error.
My client settings are:
new RabbitMQOptions().setAutomaticRecoveryEnabled(true)
.setReconnectAttempts(0)
.setNetworkRecoveryInterval(1000L);
Are there some settings or something to prevent this situation ?
For now, I am trying to resolve the issue with the following workaround:
publisherClient.rxPublish(......)
.timeout(5, TimeUnit.SECONDS)
.doOnError(err -> {
if (err instanceof TimeoutException) {
LOG.warn("Publisher did not recover, so it will be restarted");
publisherClient.restart();
}
})
.retry(1L, err -> err instanceof TimeoutException)
As a small update on the issue:
It seems to be reproducible, if we try to publish a message while a connection to RabbitMQ is down, we won't be able to publish any message later even if the connection is recovered and everything seems to be ok. The call to publisherClient.rxPublish(......) never completes
Thanks for help
How are you testing this? And how are you creating the exchange and queue that you are publishing to?
If the objects are not there when the server comes back then all messages will silently disappear.
In order to create the objects you need to use the connection established callbacks, which are not currently accessible via the Rx wrapper.
try this:
publisherClient.rxPublish(......)
.timeout(5, TimeUnit.SECONDS)
.doOnError(err -> {
if (err instanceof TimeoutException) {
LOG.warn("Publisher did not recover, so it will be restarted");
publisherClient.restart();
}
})
.retry(1L, err -> err instanceof TimeoutException)
.subscribe()

Where are published messages kept while RabbitMQ brokers are in a blocked state?

I noticed that when brokers are in a blocked state due to high watermark messages will not be accepted. However when they get unblocked the messages that are sent when the brokers were in a blocked state are accepted again (while the publisher is down, so they are not being republished).
Where are the messages kept? Is there a maximum amount of messages that can be kept like this, and how do I see how many? Is this behavior configurable?
I'm using a CachingConnectionFactory with a publisherConfirm in order to confirm messages are ack'd, but in this case it results in false information. The publisher confirm times out, but the broker eventually processes the message anyway.
The com.rabbitmq.client.impl.AMQChannel has a logic like this:
public void quiescingTransmit(AMQCommand c) throws IOException {
synchronized (_channelMutex) {
if (c.getMethod().hasContent()) {
while (_blockContent) {
try {
_channelMutex.wait();
} catch (InterruptedException ignored) {
Thread.currentThread().interrupt();
}
// This is to catch a situation when the thread wakes up during
// shutdown. Currently, no command that has content is allowed
// to send anything in a closing state.
ensureIsOpen();
}
}
this._trafficListener.write(c);
c.transmit(this);
}
}
Pay attention to that while (_blockContent) {, so technically a publishing thread is blocked over here and there is no any internal queues to buffer messages until it is unblocked. We just don't go anywhere else.
See more info in official RabbitMQ docs: https://www.rabbitmq.com/connection-blocked.html
And also see Spring AMQP docs: https://docs.spring.io/spring-amqp/docs/current/reference/html/#blocked-connections-and-resource-constraints

How to tell which amqp message was not routed from basic.return response?

I'm using RabbitMQ with node-amqp lib. I'm publishing messages with mandatory flag set, and when there is no route to any queue, RabbitMQ responds with basic.return as in specification.
My problem is that, as far as I can tell, basic.return is asynchronous and does not contain any information about for which message no queue was found. Even when exchange is in confirm mode). How the hell am I supposed to tell which message was returned?
node-amqp emits 'basic-return' event on receiving the basic.return from amqp. Only thing of any use there is routing key. Since all messages with the same routing key are routed the same way. I assumed that once I get a basic.return about a specific routing key, all messages with this routing key can be considered undelivered
function deliver(routing_key, message, exchange, resolve, reject){
var failed_delivery = function(ret){
if(ret.routingKey == routing_key){
exchange.removeListener('basic-return', failed_delivery);
reject(new Error('failed to deliver'));
}
};
exchange.on('basic-return', failed_delivery);
exchange.publish(
routing_key,
message,
{ deliveryMode: 1, //non-persistent
mandatory: true
}, function(error_occurred, error){
exchange.removeListener('basic-return', failed_delivery);
if(error_occurred){
reject(error);
} else {
resolve();
}
});
}
I read the AMQP spec, because I've used the Basic Return without a problem before, but I'm also using the .NET client. I looked through the documentation on node-amqp, and I can't even see that it implements Basic.Return.
In any event, the server does respond with the full message when it could not be published. You may consider switching to a different Node.js library (for example, amqplib does have this feature (marked as Channel#on('return', function(msg) {...})).

Redis Timeout Expired message on GetClient call

I hate the questions that have "Not Enough Info". So I will try to give detailed information. And in this case it is code.
Server:
64 bit of https://github.com/MSOpenTech/redis/tree/2.6/bin/release
There are three classes:
DbOperationContext.cs: https://gist.github.com/glikoz/7119628
PerRequestLifeTimeManager.cs: https://gist.github.com/glikoz/7119699
RedisRepository.cs https://gist.github.com/glikoz/7119769
We are using Redis with Unity ..
In this case we are getting this strange message:
"Redis Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use.";
We checked these:
Is the problem configuration issue
Are we using wrong RedisServer.exe
Is there any architectural problem
Any idea? Any similar story?
Thanks.
Extra Info 1
There is no rejected connection issue on server stats (I've checked it via redis-cli.exe info command)
I have continued to debug this problem, and have fixed numerous things on my platform to avoid this exception. Here is what I have done to solve the issue:
Executive summary:
People encountering this exception should check:
That the PooledRedisClientsManager (IRedisClientsManager) is registed in a singleton scope
That the RedisMqServer (IMessageService) is registered in a singleton scope
That any utilized RedisClient returned from either of the above is properly disposed of, to ensure that the pooled clients are not left stale.
The solution to my problem:
First of all, this exception is thrown by the PooledRedisClient because it has no more pooled connections available.
I'm registering all the required Redis stuff in the StructureMap IoC container (not unity as in the author's case). Thanks to this post I was reminded that the PooledRedisClientManager should be a singleton - I also decided to register the RedisMqServer as a singleton:
ObjectFactory.Configure(x =>
{
// register the message queue stuff as Singletons in this AppDomain
x.For<IRedisClientsManager>()
.Singleton()
.Use(BuildRedisClientsManager);
x.For<IMessageService>()
.Singleton()
.Use<RedisMqServer>()
.Ctor<IRedisClientsManager>().Is(i => i.GetInstance<IRedisClientsManager>())
.Ctor<int>("retryCount").Is(2)
.Ctor<TimeSpan?>().Is(TimeSpan.FromSeconds(5));
// Retrieve a new message factory from the singleton IMessageService
x.For<IMessageFactory>()
.Use(i => i.GetInstance<IMessageService>().MessageFactory);
});
My "BuildRedisClientManager" function looks like this:
private static IRedisClientsManager BuildRedisClientsManager()
{
var appSettings = new AppSettings();
var redisClients = appSettings.Get("redis-servers", "redis.local:6379").Split(',');
var redisFactory = new PooledRedisClientManager(redisClients);
redisFactory.ConnectTimeout = 5;
redisFactory.IdleTimeOutSecs = 30;
redisFactory.PoolTimeout = 3;
return redisFactory;
}
Then, when it comes to producing messages it's very important that the utilized RedisClient is properly disposed of, otherwise we run into the dreaded "Timeout Expired" (thanks to this post). I have the following helper code to send a message to the queue:
public static void PublishMessage<T>(T msg)
{
try
{
using (var producer = GetMessageProducer())
{
producer.Publish<T>(msg);
}
}
catch (Exception ex)
{
// TODO: Log or whatever... I'm not throwing to avoid showing users that we have a broken MQ
}
}
private static IMessageQueueClient GetMessageProducer()
{
var producer = ObjectFactory.GetInstance<IMessageService>() as RedisMqServer;
var client = producer.CreateMessageQueueClient();
return client;
}
I hope this helps solve your issue too.

What WCF Exceptions should I retry on failure for? (such as the bogus 'xxx host did not receive a reply within 00:01:00')

I have a WCF client that has thrown this common error, just to be resolved with retrying the HTTP call to the server. For what it's worth this exception was not generated within 1 minute. It was generated in 3 seconds.
The request operation sent to xxxxxx
did not receive a reply within the
configured timeout (00:01:00). The
time allotted to this operation may
have been a portion of a longer
timeout. This may be because the
service is still processing the
operation or because the service was
unable to send a reply message. Please
consider increasing the operation
timeout (by casting the channel/proxy
to IContextChannel and setting the
OperationTimeout property) and ensure
that the service is able to connect to
the client
How are professionals handling these common WCF errors? What other bogus errors should I handle.
For example, I'm considering timing the WCF call and if that above (bogus) error is thrown in under 55 seconds, I retry the entire operation (using a while() loop). I believe I have to reset the entire channel, but I'm hoping you guys will tell me what's right to do.
What other
I make all of my WCF calls from a custom "using" statement which handles exceptions and potential retires. My code optionally allows me to pass a policy object to the statement so I can easily change the behavior, like if I don't want to retry on error.
The gist of the code is as follows:
[MethodImpl(MethodImplOptions.NoInlining)]
public static void ProxyUsing<T>(ClientBase<T> proxy, Action action)
where T : class
{
try
{
proxy.Open();
using(OperationContextScope context = new OperationContextScope(proxy.InnerChannel))
{
//Add some headers here, or whatever you want
action();
}
}
catch(FaultException fe)
{
//Handle stuff here
}
finally
{
try
{
if(proxy != null
&& proxy.State != CommunicationState.Faulted)
{
proxy.Close();
}
else
{
proxy.Abort();
}
}
catch
{
if(proxy != null)
{
proxy.Abort();
}
}
}
}
You can then use the call like follows:
ProxyUsing<IMyService>(myService = GetServiceInstance(), () =>
{
myService.SomeMethod(...);
});
The NoInlining call probably isn't important for you. I need it because I have some custom logging code that logs the call stack after an exception, so it's important to preserve that method hierarchy in that case.