How to use rotate_logs on a log file that is 80+gb's for RabbitMQ on windows server - rabbitmq

I need to run rabbitmqctl rotate_logs on a rabbitmq log file that is over 80gb's in size. When I tried to run this the first time it froze rabbit and no messages could be received. The freeze lasted 20 mins before I had to kill the command and restart the rabbit server.
This is a production server and completing this in a small amount of time without losing messages or killing the broker would be optimal.
Would it be possible to shut down the service and move the current log file to another location and restart the service and then run the rotate_logs command?
I'm fairly new to rabbitmq and I am not sure what the best way to handle this would be.
This is installed on a windows 2008 server as a service for a heavy traffic production site (However the message queue has a small load and only affects the administrative side of things).
Any help or insight would be appreciated.

I ran into a similar situation, but with only about 4GB of log file instead of 80.
the workaround I used was pretty much what you suggested... stop the service, move the log file and restart the service as quickly as possible.
for me, specifically, instead of moving the file while the service was stopped i just renamed it. i also wrote a commandline script to do the work for me.
this allowed me to stop the service, rename the file and restart the service in a matter of seconds.
once the service was back up and running, i was free to move / rename / whatever the large log file as needed.

Related

Hangfire jobs stuck in processing state

We are using Hangfire to download data from Azure. We are using Hangfire 1.7.6. However, after running for some time, Hangfire is having a deadlock and seems stuck in processing the job. We had to restart the service to keep it working.
There is a recurring job which is adding jobs to the other background server. Mostly the jobs are stuck when it is downloading a big file.
Has anyone faced this type of problem of hangfire jobs stuck in processing?
Please let me know if any further information is required. Any help/guidance is appreciated.
Is this not caused by the length of time it takes to complete the download from azure?
You could try testing this with large files, and see how it handles it.
Also, like #jbl asked, how is your Hangfire Server hosted? If it is hosted in IIS then remember that the Hangfire Server may lose its heartbeat if IIS shuts down the application process due to it being idle for a given period of time.
I came across this issue in the past and ended up running the application as a process on the server.
IIS is optimised to save resources, so it will shutdown processes that aren't being used. When a request is made to your application, then it fires the process back up. This will also cause any scheduled background jobs not to fire.

Redis AOF rewrite issue, no new client connections

My Redis instance has apparently stopped rewriting AOF file (it has grown to many Gbs). What is worse, it seems to stop serving new client connections (when connecting with redis-cli, connection goes through, but then it freezes on any command). That means that I cannot ask for BGREWRITEAOF. At the same time, existing connections are being served normally.
Log file doesn't show anything useful. redis-check-aof just reports that AOF file is not corrupted. I really don't want to restart the server as I don't know how long it's going to start with such a huge AOF file.
Is there a way to call AOF rewrite externally? Anything else I can do?
So I've just hit this Redis bug: https://github.com/antirez/redis/issues/2883 :(

asadmin start-domain fails when remote JMS queue is unreachable

I have 2 servers A and B running a glassfish 3.1.2.2 application server on them. Both use a JMS queue for communication, which works fine so far. If the network connection breaks for any reason, I can see in the logs of server B (the one configured to connect to the remote queue of A) that it tries to reconnect and is actually always successful in doing so as soon as A is up again.
But the problem is, that if I try to restart the glassfish instance on B while server A is unreachable, the startup process will fail after some retries and remains stuck in a kind of undefined/unusable state, i.e. the java process is started, some ports are open but the applications are not started - not even the administration console.
IMHO glassfish startup process should not wait for the queues to connect, this should be done in some kind of background process.
Has anyone of you experienced something similar? Is there anything I can configure/tune to fix this behaviour?
Never mind, it seems to have fixed itself :(
After restarting the computer,removing the deployed ear and deploying it again it just worked. I haven't experienced this behaviour since then.

BizTalk connectivity issue to SQL during VM snapshot

We have one VM for BizTalk and a separate VM for the SQL backend. We are using Veeam for backups which basically kicks off a snapshot of the VM. When this snapshot is being finalized on the SQL VM, BizTalk services on the application server fail. Usually they restart automatically but sometimes this requires manual intervention to start the services. The error below is logged on the BizTalk server.
Is there any timeout setting or config changes that will allow BizTalk services to stay up during the snapshot process?
An error occurred that requires the BizTalk service to terminate. The most common causes are the following:
1) An unexpected out of memory error.
OR
2) An inability to connect or a loss of connectivity to one of the BizTalk databases.
The service will shutdown and auto-restart in 1 minute. If the problematic database remains unavailable, this cycle will repeat.
Error message: [DBNETLIB][ConnectionRead (recv()).]General network error. Check your network documentation.
Error source:
BizTalk host name: BizTalkServerApplication
Windows service name: BTSSvc$BizTalkServerApplication
We experienced the same situation and error with both BizTalk 2009 and BizTalk 2013, each set up with two App servers and one SQL DB server.
When our VMware does the final step of the Snapshot backup on the Application servers, it freezes the application server for about 10 seconds, preventing it from receiving packets. On SQL Server 2008 and 2012, it by default will send out keep-alive packets to the clients every 30 seconds (30,000 ms). If the SQL server fails to receive a response back from the App server, it will send out 5 retries (default setting) of the keep-alive request 1 second (1,000 ms) apart. If SQL still does not receive the response back, it will terminate the connection, which will cause the BizTalk hosts on the App server to reset, and in our case, when our German-made ERP system sends its EDI documents over to BizTalk during that reset period, the transmission will fail.
We trapped the issue by running NetMon on the DB and App servers, waiting for the next error message. Upon inspection, we see the five SQL keep-alive packets being sent to the App servers 1 second apart, and at the same time there were NO packets at all received on the Application server. At first guess, one might think they were "just dropped network packets", which is rarely the case. We then made the correlation to the timing of the VM Snapshots, and now confirm each time the snapshot finishes each day, the App servers freeze.
As a Short-to-mid-term workaround, we raised the number of retries SQL attempts before declaring a connection dead, (5 by default), by adding the registry value TcpMaxDataRetransmissions and setting it to 30 (thus 30 seconds before SQL declares the client unresponsive). This has masked the problem for now for us, and use at your own discretion.
We are also looking at an Agent-based version of the VM Snapshot, which may alleviate the condition of freezing the server.
Is there any timeout setting or config changes that will allow BizTalk services to stay up during the snapshot process?
Not that I am aware of, however you might want to Google config options in the btsntsvc.exe.config file which is located in your BizTalk installation directory.
All messages that pass through BizTalk are written to the BizTalkMsgBoxDb and its other databases are involved if you are running tracking, BAM etc. The only service that can cache 'stuff' and handle a database outage is the Enterprise Single Sign-On (ESSO) Service. BizTalk therefore needs a persistent connection to the database server to remain 'up', hence why your Host Instance (BizTalkServerApplication) is stopping - it simply wouldn't be able to process messages if the database wasn't there.
I would add that your approach to back-ups probably isn't supported by Microsoft and I would further suggest that you seriously consider whether an approach that takes your database server offline during the backup is viable?
BizTalk has a pretty robust backup solution for its various databases built into the product, and I would recommend that you take a look at using this supported method.
If you do need to take snapshots of the database system - say once a night - you might want to consider stopping the BizTalk Host Instances, performing the snapshot, and then re-starting the Host Instances through some scripted task.
You might also want to consider checking whether there are any hotfixes for your version of BizTalk Server included in a Cumulative Update that might help address your problem.

mass-restarting httpd on lots of EC2 instances

I am running a variable number of EC2 instances (CentOS 64) that contain an apache web server that caches a bunch of code in production mode.
Now every time I make some changes to the code (generally on a weekly basis) I have to log into each one of them instances and do a "su" then "service httpd restart"
Is there a way to automate this so that I can run a single command on one of the instances it would connect to all others and restart it? Getting really time consuming especially when the application has spawned some 20-30 instances on its own (happens on some days when we get high traffic)
Thanks!
Dancer's shell, dsh, is provided specifically to do this. No 'scripting' required. As #tix3 suggests, you should probably also convince sudo on those machines (configure /etc/sudoers using visudo) to configure them to accept your restart command.