Predis - Removing server from the connection pool - redis

Say, I am having N servers in the predis connection pool. I found that when one of the server does down, predis does not work(i.e. new predis/client(s1,s2,...) does not return successfully if any of the server Si is down). First, the entry of that failed server needs to be removed manually, and only after this predis resumes its work.
Since, predis claims to be using consistent hashing, shouldn't this be the case that predis automatically detects which of the server is not responding ( & has failed), and distribute the keys stored on the failed server to the other working servers?

Predis does use consistent hashing, but it is left to you to ensure that all servers in the pool are up and responding. Monitoring server availability is not intrinsically implied with consistent hashing.
You can check each server before you attempt to connect, and modify your connection pool based on your checks. You can store the available server list for the pool elsewhere and have some other process constantly watching and modifying the available server list. You could just assume they are always all up, and only check which ones need to be removed on a failure, or you can use any combination of the above. The bottom line is that predis does not, at the moment, do it for you.

Related

Apache Ignite Force Server Mode

We are trying to prevent our application startups from just spinning if we cannot reach the remote cluster. From what I've read Force Server Mode states
In this case, discovery will happen as if all the nodes in topology
were server nodes.
What i want to know is:
Does this client then permanently act as a server which would run computes and store caching data?
If connection to the cluster does not happen at first, a later connection to an establish cluster cause issue with consistency? What would be the expect behavior with a Topology version mismatch? Id their potential for a split brain scenario?
No, it's still a client node, but behaves as a server on discovery protocol level. For example, it can start without any server nodes running.
Client node can never cause data inconsistency as it never stores the data. This does not depend forceServerMode flag.

Is it required to start MSDTC service on database server along with web server? Also should it be running on mirroring server too?

My project supports nested transactions and thus we have MSDTC service running on web server as well as on database server. The project is working fine. However, we have database mirroring established over database server and thus whenever fail-over happens, site page where, nested transactions are used, throws an error:
The operation is not valid for the state of the transaction.
We have MSTDC service running on mirroring database too. Please suggest what should be done to overcome this problem.
In the default DTC setup it is the DTC of the server that initiates the transactions (the web server in your case) that coordinates them. When the first database server goes down, it rollbacks its current transaction and notifies the transaction coordinator of this and that is why you get the error. The webserver cannot commit the transaction because at least one participant has voted for a rollback.
I don't think you can get around that. What your webserver should do is retry the complete transaction. Database calls would than be handled by the mirror server and would succeed.
That is at least my opinion. I'm no authority on distributed transactions, nor on database clusters with automatic failover...

BizTalk connectivity issue to SQL during VM snapshot

We have one VM for BizTalk and a separate VM for the SQL backend. We are using Veeam for backups which basically kicks off a snapshot of the VM. When this snapshot is being finalized on the SQL VM, BizTalk services on the application server fail. Usually they restart automatically but sometimes this requires manual intervention to start the services. The error below is logged on the BizTalk server.
Is there any timeout setting or config changes that will allow BizTalk services to stay up during the snapshot process?
An error occurred that requires the BizTalk service to terminate. The most common causes are the following:
1) An unexpected out of memory error.
OR
2) An inability to connect or a loss of connectivity to one of the BizTalk databases.
The service will shutdown and auto-restart in 1 minute. If the problematic database remains unavailable, this cycle will repeat.
Error message: [DBNETLIB][ConnectionRead (recv()).]General network error. Check your network documentation.
Error source:
BizTalk host name: BizTalkServerApplication
Windows service name: BTSSvc$BizTalkServerApplication
We experienced the same situation and error with both BizTalk 2009 and BizTalk 2013, each set up with two App servers and one SQL DB server.
When our VMware does the final step of the Snapshot backup on the Application servers, it freezes the application server for about 10 seconds, preventing it from receiving packets. On SQL Server 2008 and 2012, it by default will send out keep-alive packets to the clients every 30 seconds (30,000 ms). If the SQL server fails to receive a response back from the App server, it will send out 5 retries (default setting) of the keep-alive request 1 second (1,000 ms) apart. If SQL still does not receive the response back, it will terminate the connection, which will cause the BizTalk hosts on the App server to reset, and in our case, when our German-made ERP system sends its EDI documents over to BizTalk during that reset period, the transmission will fail.
We trapped the issue by running NetMon on the DB and App servers, waiting for the next error message. Upon inspection, we see the five SQL keep-alive packets being sent to the App servers 1 second apart, and at the same time there were NO packets at all received on the Application server. At first guess, one might think they were "just dropped network packets", which is rarely the case. We then made the correlation to the timing of the VM Snapshots, and now confirm each time the snapshot finishes each day, the App servers freeze.
As a Short-to-mid-term workaround, we raised the number of retries SQL attempts before declaring a connection dead, (5 by default), by adding the registry value TcpMaxDataRetransmissions and setting it to 30 (thus 30 seconds before SQL declares the client unresponsive). This has masked the problem for now for us, and use at your own discretion.
We are also looking at an Agent-based version of the VM Snapshot, which may alleviate the condition of freezing the server.
Is there any timeout setting or config changes that will allow BizTalk services to stay up during the snapshot process?
Not that I am aware of, however you might want to Google config options in the btsntsvc.exe.config file which is located in your BizTalk installation directory.
All messages that pass through BizTalk are written to the BizTalkMsgBoxDb and its other databases are involved if you are running tracking, BAM etc. The only service that can cache 'stuff' and handle a database outage is the Enterprise Single Sign-On (ESSO) Service. BizTalk therefore needs a persistent connection to the database server to remain 'up', hence why your Host Instance (BizTalkServerApplication) is stopping - it simply wouldn't be able to process messages if the database wasn't there.
I would add that your approach to back-ups probably isn't supported by Microsoft and I would further suggest that you seriously consider whether an approach that takes your database server offline during the backup is viable?
BizTalk has a pretty robust backup solution for its various databases built into the product, and I would recommend that you take a look at using this supported method.
If you do need to take snapshots of the database system - say once a night - you might want to consider stopping the BizTalk Host Instances, performing the snapshot, and then re-starting the Host Instances through some scripted task.
You might also want to consider checking whether there are any hotfixes for your version of BizTalk Server included in a Cumulative Update that might help address your problem.

How to troubleshoot issues caused by clustering or load balancing?

Hi I have a application that is deploy on two weblogic app servers
recently we have issue that for certain cases the user session returned is null. Developer feedback is that it could be caused by the session not replicating to the other server.
How do we prove if this is really the case?
Are you using a single session store that both application servers can access via some communication protocol? If not, then it is definitely the case. Think about it, if your weblogic servers are storing the session in memory anywhere, and having users pass their session id via cookies, than the other server has no way of accessing the memory on the other machine. Unless you are using sticky load balancing. Are you?
There's 2 concepts to consider here - Session stickiness and session replication.
Session Stickiness is a mechanism where weblogic server ensures that if a request from a user with session A goes to server 1 then the next request from user with session A will go to server 1 only.
This is achieved by configuring a hardware loadbalancer (like F5) which is capable of providing session stickiness. or configuring weblogic proxy installed on apache/iis/weblogic.
The first time a request reached WLS managed server, it responds with a session id and appends to it the JVM id of the server (this is the primary id), if the managed server is part of a cluster, it also attaches a secondary server jvm id (the secondary server is the server where the session is being replicated)
The proxy maintains a table of all JVM id's and corresponding IP of managed server, it also checks periodically if the servers are up and running or not.
The next time when another request passes the proxy with existing session id and a primary jvm id, the proxy parses this and tries to send the request to that server, if it cannot within some time it tries to send to secondary server.
Session Replication - This is enabled by default when you configure a WLS cluster with 2 or more managed server. Each time any data is updates to a session, its data is replication in a secondary server too.
So in your case if your application users are loosing session or getting redirected to login page between normal usage, then check that the session did not get invalidated because of a timeout, if you have defined a cluster and using WLS proxy then check the proxy debug output to make sure the primary and secondary server are being appended to the session id.
Finally there's a simple example in the sample application deployment of wls that you can use to test session replication and failover functionality.
So to prove why session is getting lost,
1) check server log to see if session got invalidated because of timeout,
2) if using wlproxy, enable debug, and the next time the issue happens check in the proxy log if the request was sent to a different server, and if that server is not the secondary server.

Connection Problems With Mirrored Database

The setup is as follows:
A C++ client connects via OLEDB/SQL Native Client to a SQL Server 2005 database located on another machine. The server is setup with mirroring (automatic failover) with a synchronized server located on yet another server and a witness server on another server.
Occassionally (once every couple of days), our application seizes up in that it appears to attempt to establish a database connection to the database and rather than simply failing and OLEDB throwing a database connection failure it just gets "stuck" (we have a timeout for the connection but it's never timing out). 24 to 36 hours later we'll get an error:
TCP Provider: An existing connection was forcibly closed by the remote host.
And things will continue you on with lots of these errors and our app will eventually need to be restarted. We can't really figure out what condition could be causing this behavior and what we can do about it?
In preliminary research, I've seen some related problems that were solved by setting the Connection Lifetime connection string property to something non-zero.
Does anyone have any thoughts on what might be going on here?