I would need to run EJ technologies license server for serving floating license for JProfiler. Furthermore, I need to monitor the service via monitoring solutions (i.e. prometheus), properly.
So question is, from an operational perspective, how to determine, if the license server is "healthy", means not only the process is running and the TCP socket is there, but the service does the job?
There is no health check port for the license server. What you can do is to run
bin/admin -list
and check if the process exit code is 0. If the exit code is 1, the license server is not running or unresponsive.
Related
Problem
When executing a data-heavy SSIS package that inserts the data from a database in EnvironmentA1 to a database EnvironmentB1, I get the following error:
A fatal error occurred while reading the input stream from the network. The session will be terminated (input error: 10060, output error: 0)
Context Information
EnvironmentA1 - virtual machine in local data center, running SQL Server 2017
EnvironmentB1 - virtual machine in Azure, running SQL Server 2017
The package is being executed from SSIS Catalog scheduled daily by SQL Agent. Very occasionally it will succeed but it is now generally expected to fail every time it runs, different step every time.
What is really baffling to me about this is that if I set to run the same package interactively in Visual Studio using the exact same connection strings with the same security context for both EnvironmentA1 & EnvironmentB1 connection managers it will succeed every time without any issues. The Visual Studio itself is installed elsewhere in EnvironmentC1.
This is how example entries in SQL Error Log on EnvironmentB1 look like around the time of failure:
Error messages from SSIS Catalog execution report:
Everything above and the research made suggest that this is network related issue. The common suggestion found was to disable any TCP Offloading related features which I did for both environments but that didn't make any difference.
EnvironmentA1:
EnvironmentB1:
Additionally for testing purposes I disabled the following features from NIC configuration on each environmet:
EnvironmentA1:
Receive-Side Scaling State
Large Send Offload V2 IPv4
Large Send Offload V2 IPv6
TCP Checksum Offload IPv4
TCP Checksum Offload IPv6
EnvironmentB1:
Receive-Side Scaling
Large Send Offload Version 2 IPv4
Large Send Offload Version 2 IPv6
TCP Checksum Offload IPv4
TCP Checksum Offload IPv6
IPSec Offload
Also to note there are other SSIS packages that interact with same both environments and some of them has never produced a similar error, but they are either dealing with insignificant amount of data or pushing it in the opposite direction (EnvironmentB1 to EnvironmentA1)
As a temporary measure I have also tried deploying the package to the SSIS Catalog of EnvironmentA2 (development version of EnvironmentA1) and scheduling execution using production connection strings, but it gets the exact same issue and the only guaranteed way to run the package successfully remains running it via Visual Studio.
If anyone could at least point me in the right direction of diagnosing this issue, that would be greatly appreciated. Please let me know if I can add any other info for the context.
Your 3rd SSIS error states the connection was forceably closed by remote host.
That suggests firewall or network filtering issues. Check with your network guys if that could be the case.
There is a System process (PID == 4) in windows. It uses more internet, how can I block this process from using the internet,
once I blocked with firewall inbound and outbound connections, it has worked for 5 days, after 5th day it uses the internet again.
"System Process" represents the operating system (some system services) and the drivers.
Notably, antiviruses and malware tools use the "System Process". Blocking it indiscriminately is not a good practice as you can break core functionalities.
Inspect your configuration and the traffic in order to spot which "applcation" is consuming so much bandwidth.
I was able to find that osquery can work in interactive mode (osqueryi) and in daemon mode (osqueryd), in which it will periodically execute SQL queries in the background on a localhost.
How about remote execution of SQL queries - for example, REST service or JDBC-driver?
When osquery is running in daemon mode, you can enable the distributed query facilities. When this is enabled, osqueryd will periodically check in to a remote server to see whether there are queries for it to execute (typical intervals for this check range from 10 seconds to 1 minute).
Note that due to the nature of the environments that osquery runs in, the osquery agent does not listen for incoming connections. It only ever makes outgoing connections to a remote server to check for queries to execute.
To take advantage of this, you need a server implementing the osquery remote APIs. There are a handful of open-source options available:
Fleet (disclaimer: I build this)
Zentral
Doorman
SGT
Security note: providing remote execution on an osquery agent can be very dangerous since it can retrieve sensitive information on the device it runs on. If you plan to serve some sort of a web page allowing direct queries on your agent, be aware that since osquery provide an SQL abstraction of your system, it can be vulnerable to injections.
I would really appreciated another perspective on an issue we have been experiencing.
The environment:
We have a small subset of VMs (5 Windows Server 2008 R2 VM's) hosted on a Windows Server 2012 Cluster of 8 Physical Hosts which supports 100's over VMs across various OS (2008/2012 etc).
The issue:
Servers within the subset of VMs experience widespread network SERVICE failures. The failure presents itself as a loss in connectivity for a large number of network related services operating on the VMs (including certain critical network dependant applications).
The impacts:
Server remains online.
Inability to RDP to the servers via Domain Accounts (Local accounts are fine).
Windows event logs associated with Netlogon Failure: Event ID 5719 - This computer was not able to set up a secure session with a domain controller in domain DOWNERGROUP due to the following:
The RPC server is unavailable. This may lead to authentication problems.
Windows event logs assocaited with Group Policy Failure:
Event ID 1054:The processing of Group Policy failed. Windows could not
obtain the name of a domain controller. This could be caused by a name
resolution failure. Verify your Domain Name System (DNS) is configured
and working correctly
Widespread Agent Failure (AV, Monitoring, Application) - Lack of connectivty to centralised management servers.
The resolution(s). Stopping an agent service. Strange however its not limited to a specific agent however if we stop agent A, the server comes back to life, however if we also stop agent B, the server comes back to life with Agent A still running. Restarting the VM also resolves the issue.
Note that these events do not appear on other VMs hosted off the same host at the time of the outage. Also note that the guest is located on the same host prior to, during and after the outage.
We have investigated the suspicion that their may be issues with Dynamic Range Port Allocation with the server possibly getting into a bottleneck state. We have implementedthe "MaxUserPort" and "TCPTimedWaitDelay" registry parameters and have set them to 65k and 30 respectively.
Also note that when an outage occurs, it does not always occur on the same VMs in the group. Often times it is 2, 3, 4 or all servers.
Im really just asking if anyone can see these symptoms and relate to possible causes for our situation.
Any help/discussion would be appreciated.
Well, this turned out to be an interesting resolution.
We discovered that one of our server agents, while not actually showing open ports in Netstat, had over 40,000 handles growing linearly over time.
Had to enable the "handles" column in task manager to be able to see this info.
This was the miracle post...
http://blogs.technet.com/b/kimberj/archive/2012/07/06/sever-quot-hangs-quot-and-ephemeral-port-exhaustion-issues.aspx
We have one VM for BizTalk and a separate VM for the SQL backend. We are using Veeam for backups which basically kicks off a snapshot of the VM. When this snapshot is being finalized on the SQL VM, BizTalk services on the application server fail. Usually they restart automatically but sometimes this requires manual intervention to start the services. The error below is logged on the BizTalk server.
Is there any timeout setting or config changes that will allow BizTalk services to stay up during the snapshot process?
An error occurred that requires the BizTalk service to terminate. The most common causes are the following:
1) An unexpected out of memory error.
OR
2) An inability to connect or a loss of connectivity to one of the BizTalk databases.
The service will shutdown and auto-restart in 1 minute. If the problematic database remains unavailable, this cycle will repeat.
Error message: [DBNETLIB][ConnectionRead (recv()).]General network error. Check your network documentation.
Error source:
BizTalk host name: BizTalkServerApplication
Windows service name: BTSSvc$BizTalkServerApplication
We experienced the same situation and error with both BizTalk 2009 and BizTalk 2013, each set up with two App servers and one SQL DB server.
When our VMware does the final step of the Snapshot backup on the Application servers, it freezes the application server for about 10 seconds, preventing it from receiving packets. On SQL Server 2008 and 2012, it by default will send out keep-alive packets to the clients every 30 seconds (30,000 ms). If the SQL server fails to receive a response back from the App server, it will send out 5 retries (default setting) of the keep-alive request 1 second (1,000 ms) apart. If SQL still does not receive the response back, it will terminate the connection, which will cause the BizTalk hosts on the App server to reset, and in our case, when our German-made ERP system sends its EDI documents over to BizTalk during that reset period, the transmission will fail.
We trapped the issue by running NetMon on the DB and App servers, waiting for the next error message. Upon inspection, we see the five SQL keep-alive packets being sent to the App servers 1 second apart, and at the same time there were NO packets at all received on the Application server. At first guess, one might think they were "just dropped network packets", which is rarely the case. We then made the correlation to the timing of the VM Snapshots, and now confirm each time the snapshot finishes each day, the App servers freeze.
As a Short-to-mid-term workaround, we raised the number of retries SQL attempts before declaring a connection dead, (5 by default), by adding the registry value TcpMaxDataRetransmissions and setting it to 30 (thus 30 seconds before SQL declares the client unresponsive). This has masked the problem for now for us, and use at your own discretion.
We are also looking at an Agent-based version of the VM Snapshot, which may alleviate the condition of freezing the server.
Is there any timeout setting or config changes that will allow BizTalk services to stay up during the snapshot process?
Not that I am aware of, however you might want to Google config options in the btsntsvc.exe.config file which is located in your BizTalk installation directory.
All messages that pass through BizTalk are written to the BizTalkMsgBoxDb and its other databases are involved if you are running tracking, BAM etc. The only service that can cache 'stuff' and handle a database outage is the Enterprise Single Sign-On (ESSO) Service. BizTalk therefore needs a persistent connection to the database server to remain 'up', hence why your Host Instance (BizTalkServerApplication) is stopping - it simply wouldn't be able to process messages if the database wasn't there.
I would add that your approach to back-ups probably isn't supported by Microsoft and I would further suggest that you seriously consider whether an approach that takes your database server offline during the backup is viable?
BizTalk has a pretty robust backup solution for its various databases built into the product, and I would recommend that you take a look at using this supported method.
If you do need to take snapshots of the database system - say once a night - you might want to consider stopping the BizTalk Host Instances, performing the snapshot, and then re-starting the Host Instances through some scripted task.
You might also want to consider checking whether there are any hotfixes for your version of BizTalk Server included in a Cumulative Update that might help address your problem.