Recover connection when a new master is elected in the cluster - redis

I've a Redis cluster with 3 nodes; 1 is the master and the other 2 are slaves, holding the replica of the master. When I kill the master instance, Redis Sentinel promotes another node to be the master, which starts to accept writes.
During my tests I noticed that once the new master is promoted, the first operation in Redis with SE.Redis fails with:
StackExchange.Redis.RedisConnectionException: SocketFailure on GET
---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote
host. ---> System.Net.Sockets.SocketException: An existing connection
was forcibly closed by the remote host
To avoid it, I've implemented a retry logic as below. Is there any better alternative?
private RedisValue RedisGet(string key)
{
return RedisOperation(() =>
{
RedisKey redisKey = key;
RedisValue redisValue = connection.StringGet(redisKey);
return (redisValue);
});
}
private T RedisOperation<T>(Func<T> act)
{
int timeToSleepBeforeRetryInMiliseconds = 20;
DateTime startTime = DateTime.Now;
while (true)
{
try
{
return act();
}
catch (Exception e)
{
Debug.WriteLine("Failed to perform REDIS OP");
TimeSpan passedTime = DateTime.Now - startTime;
if (this.retryTimeout < passedTime)
{
Debug.WriteLine("ABORTING re-try to REDIS OP");
throw;
}
else
{
int remainingTimeout = (int)(this.retryTimeout.TotalMilliseconds - passedTime.TotalMilliseconds);
// if remaining time is less than 1 sec than wait only for that much time and than give a last try
if (remainingTimeout < timeToSleepBeforeRetryInMiliseconds)
{
timeToSleepBeforeRetryInMiliseconds = remainingTimeout;
}
}
Debug.WriteLine("Sleeping " + timeToSleepBeforeRetryInMiliseconds + " before next try");
System.Threading.Thread.Sleep(timeToSleepBeforeRetryInMiliseconds);
}
}
}

TLDR: don't use Sentinel with Stackexchange.Redis as Sentinel support is still not implemented in this client library.
See https://github.com/StackExchange/StackExchange.Redis/labels/sentinel for all the open issues, there is also a pretty good PR open for ~1 year now.
That being said, I also had relatively good experience with retries, but I would never use that approach in production as it is not reliable at all.

Related

Cosmos DB - changeFeedWorker not working properly after Cosmos db reach 50 gb size and splitted db into 2 partitions

Just want to know if someone already get some issues with changeFeedWorker after that cosmos db reach 50Gb size limit for one partition and automatically split it into 2 partitions.
Since this split we have observed that after adding many new items to the db, changefeedworkers was not always triggered and our view that should get changes is partially updated.
In the lease container we can see now that we moved to 2 "LeaseToken" (1 and 2) was 0 before.
If someone has an idea where to look as it was working fine before the db split.
Here is how I start my worker:
async Task IChangeFeedWorker.StartAsync()
{
await _semaphoreSlim.WaitAsync();
string operationName = $"{nameof(CosmosChangeFeedWorker)}.{nameof(IChangeFeedWorker.StartAsync)}";
using var _ = BeginChangeFeedWorkerScope(_logger, operationName, _processorName, _instanceName);
try
{
if (Active)
{
return;
}
Container eventContainer = _cosmosClient.GetContainer(_databaseId, _eventContainerId);
Container leaseContainer = _cosmosClient.GetContainer(_databaseId, _leaseContainerId);
_changeFeedProcessor = eventContainer
.GetChangeFeedProcessorBuilder<CosmosEventChange>(_processorName, HandleChangesAsync)
.WithInstanceName(_instanceName)
.WithLeaseContainer(leaseContainer)
.WithStartTime(_startTimeOfTrackingChanges)
.Build();
await _changeFeedProcessor.StartAsync();
Active = true;
_logger.LogInformation(
"Change feed processor instance has been started.",
_processorName, _instanceName);
}
catch (Exception e)
{
_logger.LogError(e,
"Starting of change feed processor instance has failed.",
_processorName, _instanceName);
_changeFeedProcessor = null;
throw;
}
finally
{
_semaphoreSlim.Release();
}
}

Akka.net: Should I specify "split brain resolver" configuration for Lighthouse/Seed nodes

I have this application using Akka.net cluster feature. The people who wrote the code have left the company.
I am trying to understand the code and we are planning a deployment.
The cluster has 2 types of nodes
QueueServicer: supports sharding and only these nodes should participate in sharding.
LightHouse: They are just seed nodes, nothing else.
Lighthouse : 2 nodes
QueueServicer : 3 Nodes
I see one of the QueueServicer node unable to join the cluster. Both lighthouse nodes are refusing connection. It constantly tries to join and never succeeds. This has been happening for the last 5 days or so and the node is never dying also. Its CPU and memory usage is high. Also It doesn't have any queue processor actors running when filtered search through the log. It takes long hours for Garbage collection etc. I see in the log for this node, the following.
{"timestamp":"2021-09-08T22:26:59.025Z", "logger":"Akka.Event.DummyClassForStringSources", "message":Tried to associate with unreachable remote address [akka.tcp://myapp#lighthouse-1:7892]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://myapp#lighthouse-1:7892] Caused by: [System.AggregateException: One or more errors occurred. (Connection refused akka.tcp://myapp#lighthouse-1:7892) ---> Akka.Remote.Transport.InvalidAssociationException: Connection refused akka.tcp://myapp#lighthouse-1:7892 at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress) at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress) --- End of inner exception stack trace --- at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__12_18(Task1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
{"timestamp":"2021-09-08T22:26:59.025Z", "logger":"Akka.Event.DummyClassForStringSources", "message":Tried to associate with unreachable remote address [akka.tcp://myapp#lighthouse-0:7892]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://myapp#lighthouse-0:7892] Caused by: [System.AggregateException: One or more errors occurred. (Connection refused akka.tcp://myapp#lighthouse-0:7892) ---> Akka.Remote.Transport.InvalidAssociationException: Connection refused akka.tcp://myapp#lighthouse-0:7892 at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress) at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress) --- End of inner exception stack trace --- at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__12_18(Task1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
There are other "Now supervising", "Stopping" "Started" logs which I am omitting here.
Can you please verify if the HCON config is correct for split brain resolver and Sharding?
I think LightHouse/SeeNodes should not have the sharding configuration specified. I think it is a mistake.
I also think, split brain resolver configuration might be wrong in LightHouse/SeedNodes and should not be specified for seed nodes.
I appreciate your help.
Here is the HOCON for QueueServicer Trimmed
akka {
    loggers = ["Akka.Logger.log4net.Log4NetLogger, Akka.Logger.log4net"]
    log-config-on-start = on
    loglevel = "DEBUG"
    actor {
        provider = cluster
        serializers {
            hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
        }
        serialization-bindings {
            "System.Object" = hyperion
        }
    }
remote {
dot-netty.tcp {
….
}
}
cluster {
seed-nodes = ["akka.tcp://myapp#lighthouse-0:7892",akka.tcp://myapp#lighthouse-1:7892"]
roles = ["QueueProcessor"]
sharding {
role = "QueueProcessor"
state-store-mode = ddata
remember-entities = true
passivate-idle-entity-after = off
}
downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
split-brain-resolver {
active-strategy = keep-majority
stable-after = 20s
keep-majority {
role = "QueueProcessor"
}
}
down-removal-margin = 20s
}
extensions = ["Akka.Cluster.Tools.PublishSubscribe.DistributedPubSubExtensionProvider,Akka.Cluster.Tools"]
}
Here is the HOCON for Lighthouse
akka {
    loggers = ["Akka.Logger.log4net.Log4NetLogger, Akka.Logger.log4net"]
    log-config-on-start = on
    loglevel = "DEBUG"
    actor {
        provider = cluster
        serializers {
            hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
        }
        serialization-bindings {
            "System.Object" = hyperion
        }
    }
remote {
dot-netty.tcp {
…
}
}
cluster {
seed-nodes = ["akka.tcp://myapp#lighthouse-0:7892",akka.tcp://myapp#lighthouse-1:7892"]
roles = ["lighthouse"]
sharding {
role = "lighthouse"
state-store-mode = ddata
remember-entities = true
passivate-idle-entity-after = off
}
downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
split-brain-resolver {
active-strategy = keep-oldest
stable-after = 30s
keep-oldest {
down-if-alone = on
role = "lighthouse"
}
}
}
}
I meant to reply to this sooner.
Here is your problem: you're using two different split brain resolver configurations - one for the QueueServicer and one for Lighthouse. Therefore, how your cluster resolves itself is going to be quite different depending upon who is the leader of each half of the cluster.
I would stick with a simple keep-majority strategy and use it uniformly on all nodes throughout the cluster - we're very likely going to enable this by default in Akka.NET v1.5.
If you have any questions, please feel free to reach out to us: https://petabridge.com/

CloudSQL persitent connections error

I'm using Google Cloud SQL 2nd generations instances and persistent connections using PHP PDO driver but, from time to time, and just in some App Engine instances, the pooled connections get corrupted and the connections begin to fail, sending an ugly message to the user.
I tried to solve this trying to make new connnections, even disabling persistency, but it didn't work:
for ($attempt = 1; !$this->link; $attempt++) {
try {
if ($attempt > $persistent / 2) {
unset($options[PDO::ATTR_PERSISTENT]);
}
$this->link = new PDO($dsn_string, $user, $pass, $options);
} catch (PDOException $err) {
if ($attempt <= $persistent) {
usleep($attempt * 100000);
} else {
throw new DB_Exception("Error connecting database (after $attempt attempts):\n" . $err->getMessage(), $err->getCode(), null, $err);
}
}
}

how to delete a null queue(i.e without queue name ) in activemq 5.8.0?

How to delete a null queue(i.e without queue name ) in activemq 5.8.0 ?
I have a problem with null queue when I delete the null queue with the help of delete button in active 5.8.0 console then throws some Error as shown below.
i.e Error!
Exception occurred while processing this request, check the log for
more information!
What do you want to do next?
There were some issues around creation of Queues with blank names in earlier releases. I'm not sure that you will be successful in deleting the Queue without simply deleting all that KahaDB files and starting over.
One thing to try would be to use JConsole to connect to the broker and invoking the remove operation on the Queue MBean.
we ran into this issue recently and I logged this ticket on it: https://issues.apache.org/jira/browse/AMQ-5211
we are using mkahadb (configured to use a separate directory per destination) and were able to simply delete the corresponding directory (/data//kaha/queue#3a#2f#2f) and restart AMQ...
otherwise, try JMX or you'll need to wipe out the entire message store and start over as Tim suggested...
Here's some code I wrote a while back to deal with a similar problem. It connects using JMX and removes any empty queues that have never been used.
import javax.management.JMX;
import javax.management.MBeanServerConnection;
import javax.management.ObjectName;
import javax.management.remote.JMXConnector;
import javax.management.remote.JMXConnectorFactory;
import javax.management.remote.JMXServiceURL;
import org.apache.activemq.broker.jmx.BrokerViewMBean;
import org.apache.activemq.broker.jmx.QueueViewMBean;
public class CleanQueues {
public static void main(String[] args) throws Exception {
if (args.length != 1 && args.length != 2) {
System.out.println("Usage: CleanQueues host [port]");
System.exit(1);
}
String host = args[0];
String port = "1099";
if (args.length == 2) {
port = args[2];
}
JMXServiceURL url = new JMXServiceURL("service:jmx:rmi:///jndi/rmi://"+host+":"+port+"/jmxrmi");
JMXConnector jmxc = JMXConnectorFactory.connect(url, null);
MBeanServerConnection connection = jmxc.getMBeanServerConnection();
ObjectName broker = null;
for (ObjectName objectName : connection.queryNames(new ObjectName("org.apache.activemq:BrokerName=*,Type=Broker"),null)) {
broker = objectName;
}
if (broker == null) {
System.out.println("Could not find broker name.");
System.exit(2);
}
BrokerViewMBean proxy = JMX.newMBeanProxy(connection, broker, BrokerViewMBean.class);
for (ObjectName n : proxy.getQueues()) {
QueueViewMBean q = JMX.newMBeanProxy(connection, n, QueueViewMBean.class);
if (q.getDispatchCount() == 0 && q.getConsumerCount() == 0) {
System.out.println("Removing queue: "+q.getName());
proxy.removeQueue(q.getName());
}
}
}
}

How to resend from Dead Letter Queue using Redis MQ?

Just spent my first few hours looking at Redis and Redis MQ.
Slowly getting the hang of Redis and was wondering how you could resend a message that is in a dead letter queue?
Also, where are the configuration options which determine how many times a message is retried before it goes into the dead letter queue?
Currently, there's no way to automatically resend messages in the dead letter queue in ServiceStack. However, you can do this manually relatively easily:
To reload messages from the dead letter queue by using:
public class AppHost {
public override Configure(){
// create the hostMq ...
var hostMq = new RedisMqHost( clients, retryCount = 2 );
// with retryCount = 2, 3 total attempts are made. 1st + 2 retries
// before starting hostMq
RecoverDLQMessages<TheMessage>(hostMq);
// add handlers
hostMq.RegisterHandler<TheMessage>( m =>
this.ServiceController.ExecuteMessage( m ) );
// start hostMq
hostMq.Start();
}
}
Which ultimately uses the following to recover (requeue) messages:
private void RecoverDLQMessages<T>( RedisMqHost hostMq )
{
var client = hostMq.CreateMessageQueueClient();
var errorQueue = QueueNames<T>.Dlq;
log.Info( "Recovering Dead Messages from: {0}", errorQueue );
var recovered = 0;
byte[] msgBytes;
while( (msgBytes = client.Get( errorQueue, TimeSpan.FromSeconds(1) )) != null )
{
var msg = msgBytes.ToMessage<T>();
msg.RetryAttempts = 0;
client.Publish( msg );
recovered++;
}
log.Info( "Recovered {0} from {1}", recovered, errorQueue );
}
Note
At the time of this writing, there's a possibility of ServiceStack losing messages. Please See Issue 229 Here, so, don't kill the process while it's moving messages from the DLQ (dead letter queue) back to the input queue. Under the hood, ServiceStack is POPing messages from Redis.