SQL 2008 getting connection timeouts - Not a DBA

SQL 2008 getting connection timeouts - Not a DBA - sql

Wall of text (my apologies, but you'll need to read it all):
Error Message: Database Error: Connection Timeout Expired. The timeout period elapsed while attempting to consume the pre-login handshake acknowledgement. This could be because the pre-login handshake failed or the server was una
Environment:
Virtual VMware Server 2008R2 SP1, running SQL 2008 SP3
32GB RAM - about 50 Databases
10Gb LAN connection, datastore storage provided by SSD SAN.
Application is CSTS connecting to SQL Server "DIRGE".
The application is configured to connect to another application for document retrieval "Onbase", who's database is also stored on DIRGE.
Throughout the day, CSTS will get connection time-outs. It's usually in spurts, so if one user is getting a timeout, usually someone else is getting one as well.
SQL has 28GB of the 32GB allocated. Memory utilization is a consistent >95%.
We cannot add more RAM as 2008R2 standard doesn't see more than 32GB.
CPU utilization was very high at times and the trend was it was getting more and more utilization, so we added a second CPU (2 sockets, 4 cores per socket).
I've scoured the event logs and the SQL logs, and the CSTS error logs looking for a commonality. I'm finding very little. I've resolved all the event log errors, no joy.
NOTE: Onbase server also gets connection time outs to SQL, so I don't believe it's application specific.
Scheduled Events:
Logs are backed up at 8am, 11a, 2pm, 5pm.
There's a SSIS package that runs every 15 minutes and takes about 8 minutes to run. However, I did not find any correlation to the timeouts.
There are maintenance plans that run after hours as well.
IP4 and 6 are both enabled.
Clients are referencing the database server by IP, so it's not a name resolution issue.
IP Protocols are enabled, static set to port 1433.
I ran a portqry from the Onbase server to TCP 143 and UDP 1434 and it IS listening.
We have a Solarwinds Database Analyzer running and watching this server; it says CPU and RAM are issues. I can get more details from it if anyone is interested.
I've google-fu'd the heck out of this and I just can't seem to find a good answer. From my searching, it seems this is a networking issue, but we've watched the network and I'm not seeing anything that would be the cause. Throughput is very little overall.
I will say this: The ONBASE server is on a different subnet than DIRGE, but I've ran a test DB connection using the name, named pipe and IP and they all work without issue.
The problem is I'm on a DBA so I'm learning this on the fly (I'm a Sr Systems Engineer).
I'm curious if someone has a suggestion on how to hunt this down.

Related

Sporadic Azure SQL Semaphore Timeout/Handshake error when connecting as login, but login#servername seems to work?

I am having issues reliably connecting to Azure SQL Databases from on-prem. (so yes, firewalls involved). When I connect (SSMS or invoke-sqlcmd) using just mylogin, sometimes it works but frequently I'll get a message like:
Connection timeout expired. The timeout period elapsed while attempting
to consume the pre-login handshake acknowledgement. this could be
because the pre-login handshake failed or the server was unable to
respond back in time. the duration spent while attempting to connect
to this server was 0 [Pre-Login] initialization=67;handshake=14948;
However, I've noticed that if I connect as mylogin#the-server-name (not the FQDN, just the part of the servername we can choose) then it seems to be a lot more reliable.
Is this something I can do something about? Firewall weirdness? Any help appreciated.
Looking in the ring buffer ( SELECT CAST(record as xml) as record FROM sys.dm_os_ring_buffers WHERE ring_buffer_type = 'RING_BUFFER_CONNECTIVITY') I have records, but I'm not seeing anything about failed connections, just killed connections. (and totallogintimeinmilliseconds is null)

SSIS package gets fatal error while reading the input stream from the network

Problem
When executing a data-heavy SSIS package that inserts the data from a database in EnvironmentA1 to a database EnvironmentB1, I get the following error:
A fatal error occurred while reading the input stream from the network. The session will be terminated (input error: 10060, output error: 0)
Context Information
EnvironmentA1 - virtual machine in local data center, running SQL Server 2017
EnvironmentB1 - virtual machine in Azure, running SQL Server 2017
The package is being executed from SSIS Catalog scheduled daily by SQL Agent. Very occasionally it will succeed but it is now generally expected to fail every time it runs, different step every time.
What is really baffling to me about this is that if I set to run the same package interactively in Visual Studio using the exact same connection strings with the same security context for both EnvironmentA1 & EnvironmentB1 connection managers it will succeed every time without any issues. The Visual Studio itself is installed elsewhere in EnvironmentC1.
This is how example entries in SQL Error Log on EnvironmentB1 look like around the time of failure:
Error messages from SSIS Catalog execution report:
Everything above and the research made suggest that this is network related issue. The common suggestion found was to disable any TCP Offloading related features which I did for both environments but that didn't make any difference.
EnvironmentA1:
EnvironmentB1:
Additionally for testing purposes I disabled the following features from NIC configuration on each environmet:
EnvironmentA1:
Receive-Side Scaling State
Large Send Offload V2 IPv4
Large Send Offload V2 IPv6
TCP Checksum Offload IPv4
TCP Checksum Offload IPv6
EnvironmentB1:
Receive-Side Scaling
Large Send Offload Version 2 IPv4
Large Send Offload Version 2 IPv6
TCP Checksum Offload IPv4
TCP Checksum Offload IPv6
IPSec Offload
Also to note there are other SSIS packages that interact with same both environments and some of them has never produced a similar error, but they are either dealing with insignificant amount of data or pushing it in the opposite direction (EnvironmentB1 to EnvironmentA1)
As a temporary measure I have also tried deploying the package to the SSIS Catalog of EnvironmentA2 (development version of EnvironmentA1) and scheduling execution using production connection strings, but it gets the exact same issue and the only guaranteed way to run the package successfully remains running it via Visual Studio.
If anyone could at least point me in the right direction of diagnosing this issue, that would be greatly appreciated. Please let me know if I can add any other info for the context.

Your 3rd SSIS error states the connection was forceably closed by remote host.
That suggests firewall or network filtering issues. Check with your network guys if that could be the case.

Windows Server 2008 VM - network services failing

I would really appreciated another perspective on an issue we have been experiencing.
The environment:
We have a small subset of VMs (5 Windows Server 2008 R2 VM's) hosted on a Windows Server 2012 Cluster of 8 Physical Hosts which supports 100's over VMs across various OS (2008/2012 etc).
The issue:
Servers within the subset of VMs experience widespread network SERVICE failures. The failure presents itself as a loss in connectivity for a large number of network related services operating on the VMs (including certain critical network dependant applications).
The impacts:
Server remains online.
Inability to RDP to the servers via Domain Accounts (Local accounts are fine).
Windows event logs associated with Netlogon Failure: Event ID 5719 - This computer was not able to set up a secure session with a domain controller in domain DOWNERGROUP due to the following:
The RPC server is unavailable. This may lead to authentication problems.
Windows event logs assocaited with Group Policy Failure:
Event ID 1054:The processing of Group Policy failed. Windows could not
obtain the name of a domain controller. This could be caused by a name
resolution failure. Verify your Domain Name System (DNS) is configured
and working correctly
Widespread Agent Failure (AV, Monitoring, Application) - Lack of connectivty to centralised management servers.
The resolution(s). Stopping an agent service. Strange however its not limited to a specific agent however if we stop agent A, the server comes back to life, however if we also stop agent B, the server comes back to life with Agent A still running. Restarting the VM also resolves the issue.
Note that these events do not appear on other VMs hosted off the same host at the time of the outage. Also note that the guest is located on the same host prior to, during and after the outage.
We have investigated the suspicion that their may be issues with Dynamic Range Port Allocation with the server possibly getting into a bottleneck state. We have implementedthe "MaxUserPort" and "TCPTimedWaitDelay" registry parameters and have set them to 65k and 30 respectively.
Also note that when an outage occurs, it does not always occur on the same VMs in the group. Often times it is 2, 3, 4 or all servers.
Im really just asking if anyone can see these symptoms and relate to possible causes for our situation.
Any help/discussion would be appreciated.

Well, this turned out to be an interesting resolution.
We discovered that one of our server agents, while not actually showing open ports in Netstat, had over 40,000 handles growing linearly over time.
Had to enable the "handles" column in task manager to be able to see this info.
This was the miracle post...
http://blogs.technet.com/b/kimberj/archive/2012/07/06/sever-quot-hangs-quot-and-ephemeral-port-exhaustion-issues.aspx

Suddenly bad disk I/O utilization on SQL-server even on weekend where we have few transactions

On friday our MS SQL-server 2012 suddenly showed 100% disk I/O utilization in New Relic (Performance monitor). We hade made no updates what so ever and windows update showed nothing that had happend.
The load on the server was low because fridays have low traffic on our website. The disk I/O utilization has been kept high even over the weekend.
The server is a VM-Ware machine with 16 procs and 36 gb of memory. The disk are located in a san.
We have about 5 mb of reads per second and very low on writes on the database server.
The server has about 500 I/O operations per second.
The CPU is at 25%
The database is stored in 12 files on a separate drive on the server.
No long running task are running.
The server is defragmentet and all the indexeson the database have been rebuilt.
Perfmon on the sql server shows disk que at peek 5.
Our server guys says that the SAN is running smoothly. But my gues is that something happend on that friday whick keeps our SQL have to wait for file operations.
Any ideas?

There are lots of reasons this could happen.
Most obvious, backup schedule?
If your traffic was so low on Friday that IIS shut down your Application Pool after the idle timeout, the next hit to that Website will trigger reloading of all data which is cached on application startup.
Since your database server is virtualised, the IO to that server may be also virtualised (as opposed to directly connecting the SQL Server Virtual Machine to the storage on the physical host). In this case, your database server performance may be limited by other machines on the same host saturating the link to the SAN.

The reaseon was Another SQL-server on the same host that accessed the disk a lot

BizTalk connectivity issue to SQL during VM snapshot

We have one VM for BizTalk and a separate VM for the SQL backend. We are using Veeam for backups which basically kicks off a snapshot of the VM. When this snapshot is being finalized on the SQL VM, BizTalk services on the application server fail. Usually they restart automatically but sometimes this requires manual intervention to start the services. The error below is logged on the BizTalk server.
Is there any timeout setting or config changes that will allow BizTalk services to stay up during the snapshot process?
An error occurred that requires the BizTalk service to terminate. The most common causes are the following:
1) An unexpected out of memory error.
OR
2) An inability to connect or a loss of connectivity to one of the BizTalk databases.
The service will shutdown and auto-restart in 1 minute. If the problematic database remains unavailable, this cycle will repeat.
Error message: [DBNETLIB][ConnectionRead (recv()).]General network error. Check your network documentation.
Error source:
BizTalk host name: BizTalkServerApplication
Windows service name: BTSSvc$BizTalkServerApplication

We experienced the same situation and error with both BizTalk 2009 and BizTalk 2013, each set up with two App servers and one SQL DB server.
When our VMware does the final step of the Snapshot backup on the Application servers, it freezes the application server for about 10 seconds, preventing it from receiving packets. On SQL Server 2008 and 2012, it by default will send out keep-alive packets to the clients every 30 seconds (30,000 ms). If the SQL server fails to receive a response back from the App server, it will send out 5 retries (default setting) of the keep-alive request 1 second (1,000 ms) apart. If SQL still does not receive the response back, it will terminate the connection, which will cause the BizTalk hosts on the App server to reset, and in our case, when our German-made ERP system sends its EDI documents over to BizTalk during that reset period, the transmission will fail.
We trapped the issue by running NetMon on the DB and App servers, waiting for the next error message. Upon inspection, we see the five SQL keep-alive packets being sent to the App servers 1 second apart, and at the same time there were NO packets at all received on the Application server. At first guess, one might think they were "just dropped network packets", which is rarely the case. We then made the correlation to the timing of the VM Snapshots, and now confirm each time the snapshot finishes each day, the App servers freeze.
As a Short-to-mid-term workaround, we raised the number of retries SQL attempts before declaring a connection dead, (5 by default), by adding the registry value TcpMaxDataRetransmissions and setting it to 30 (thus 30 seconds before SQL declares the client unresponsive). This has masked the problem for now for us, and use at your own discretion.
We are also looking at an Agent-based version of the VM Snapshot, which may alleviate the condition of freezing the server.

Is there any timeout setting or config changes that will allow BizTalk services to stay up during the snapshot process?
Not that I am aware of, however you might want to Google config options in the btsntsvc.exe.config file which is located in your BizTalk installation directory.
All messages that pass through BizTalk are written to the BizTalkMsgBoxDb and its other databases are involved if you are running tracking, BAM etc. The only service that can cache 'stuff' and handle a database outage is the Enterprise Single Sign-On (ESSO) Service. BizTalk therefore needs a persistent connection to the database server to remain 'up', hence why your Host Instance (BizTalkServerApplication) is stopping - it simply wouldn't be able to process messages if the database wasn't there.
I would add that your approach to back-ups probably isn't supported by Microsoft and I would further suggest that you seriously consider whether an approach that takes your database server offline during the backup is viable?
BizTalk has a pretty robust backup solution for its various databases built into the product, and I would recommend that you take a look at using this supported method.
If you do need to take snapshots of the database system - say once a night - you might want to consider stopping the BizTalk Host Instances, performing the snapshot, and then re-starting the Host Instances through some scripted task.
You might also want to consider checking whether there are any hotfixes for your version of BizTalk Server included in a Cumulative Update that might help address your problem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas