In linode production Delayed jobs gets stopped after sometime - ruby-on-rails-3

I am using delayed_jobs 3.0 gem for email notification in my rails3 app. It is deployed on linode using nginx+capistrano. My linode config is RAM of 512MB and storage space of 24GB. In that two instances are running.
Delayed jobs of second instance is giving the problem. After sometime it gets shutdown and needs manually restart. There are no errors on Production.log and delayed_jobs.log. When I issue command "free -m", it shows the result as:
total used free shared buffers cached
Mem: 496 415 80 0 5 45
-/+ buffers/cache: 364 131
Swap: 255 130 125
I am not able to find out the reason of it getting down and please suggest me the possible solution.

Related

AWS ECS Fargate Memory Utilization vs Local Docker

We are using AWS Fargate ECS Tasks for our spring webflux java 11 microservice.We are using a FROM gcr.io/distroless/java:11 java image. When our application is dockerised locally and deployed as a image inside a docker container the memory utilization is quite efficient and we can see the heap usage never crosses 50%
However when we deploy the same image using the same dockerfile in AWS Fargate as a ECS task the AWS Dashbaord shows a completely different picture.The memory utilization never comes down and Cloudwatch logs show no OutOfMemory issues at all. In AWS ECS, once deployed we have done a Peak load test, a stress test after which the memory utilization reached 94% and then did a soak test for 6 hrs. The memory utilization was still 94% without any OOM errors.Memory the garbage collection is happening constantly and not letting the application go OOM.But it stays at 94%
For testing the application's memory utilization locally we are using Visual VM. We are also trying to connect to the remote ECS task in AWS Fargate using Amazon ECS Exec but that is work in progress
We have seen the same issue with other microservices in our and other clusters as well.Once it reaches a maximum number it never comes down.Kindly help if someone has faced the same issue earlier
Edit on 10/10/2022:
We connected to AWS Fargate ECS task using the Amazon ECS Exec and below were the findings
We analysed the GC logs of the AWS ECS Fargate Task and could see the messages.It uses the default GC i.e Simple GC. We keep getting "Pause Young Allocation Failure" which means that the memory assigned to the Young Generation is not enough and hence the GC fails.
[2022-10-09T13:33:45.401+0000][1120.447s][info][gc] GC(1417) Pause Full (Allocation Failure) 793M->196M(1093M) 410.170ms
[2022-10-09T13:33:45.403+0000][1120.449s][info][gc] GC(1416) Pause Young (Allocation Failure) 1052M->196M(1067M) 460.286ms
We made some code changes associated to byteArray getting copied in memory twice and the memory did come down but not by much
/app # ps -o pid,rss
PID RSS
1 1.4g
16 16m
30 27m
515 23m
524 688
1655 4
/app # ps -o pid,rss
PID RSS
1 1.4g
16 15m
30 27m
515 22m
524 688
1710 4
Even after a full gc like below the memory does not come down:
2022-10-09T13:39:13.460+0000][1448.505s][info][gc] GC(1961) Pause Full (Allocation Failure) 797M->243M(1097M) 502.836ms
One important observation was that after running inspect heap , a full gc got trigerred and even that didnt clear up the memory.It shows 679M->149M but the ps -o pid,rss command does not show the drop neither does the AWS Container Insights graph
2022-10-09T13:54:50.424+0000][2385.469s][info][gc] GC(1967) Pause Full (Heap Inspection Initiated GC) 679M->149M(1047M) 448.686ms
[2022-10-09T13:56:20.344+0000][2475.390s][info][gc] GC(1968) Pause Full (Heap Inspection Initiated GC) 181M->119M(999M) 448.699ms
How are you running it locally do you set any parameters (cpu/memory) for the container you launch? On Fargate there are multiple levels of resource configurations (size of the task and amount of resources you assign to the container - check out this blog for more details). Also the other thing to consider is that, with Fargate, you may land on an instance with >> capacity than the task size you configured. Fargate will create a cgroup that will box your container(s) to that size but some old programs (and java versions) are not cgroup-aware and they may assume the amount of memory you have is the memory available on the instance (that you don't see) and not the task size (and cgroup) that was configured.
I don't have an exact answer (and this did not fit into a comment) but this may be an area you can explore (being able to exec into the container should help - ECS exec is great for that).

Large commit stalls halfway through

I have a problem with our subversion server. Doing small commits works fine, but as soon as someone tried to commit a large collection sizeable files the commit stalls halfway through and the client finally time out. My test set consists of roughly 2000 files and the total size of the commit is about 1 GB. When I commit the files the file uploading starts but about halfway through the transfer rate drops to 0kb/s and the commit just stalls and never recovers. If I splitting the commit into smaller pieces (<150 Mb) everything works just fine, but that breaks the atomicity of the commit structure and is something I really want to avoid.
When I look at the logs generate by Apache there is no error messages.
When I bumped the loglevel from debug to trace6 on the Apache server, there is some errors appearing at the moment when the upload stalls:
...
OpenSSL: I/O error, 2229 bytes expected to read on BIO
OpenSSL: read 1460/2229 bytes from BIO
...
Versions used:
We are running the connection to the subversion via apache, mod_dav, mod_dav_svn, mod_authz_svn and mod_auth_digest. The client connects via https.
Server:
OpenSuse 42.3
svnserve: 1.9.7
Apache: 2.4.23
Client:
Windows 10 enterprise
svn client: 1.10.0-dev.
What I tried so far:
I have tried increasing the TimeOut value in the apache configuration. The only difference is that the client ends up in stalled mode longer before posting the timeout message.
I have tried increasing the MaxKeepAliveRequests from 100 to 1000. No change.
I have tried adding SVNAllowBulkUpdates Prefer to the svn settings. No change.
Have anyone got any hints on how to debug these types of errors?

RabbitMQ crash with bump_reduce_memory_use

I am using RabbitMQ 3.7.3 on Erlang 20.2.2 deployed on a docker (image rabbitmq:3.7-management).
Memory is setup like this : Memory high watermark set to 6000 MiB (6291456000 bytes) of 8192 MiB (8589934592 bytes) total
Here is the crash report that I am getting on automatic restart of RabbitMQ :
CRASH REPORT Process <0.818.0> with 0 neighbours exited with reason:
no function clause matching
rabbit_priority_queue:handle_info(bump_reduce_memory_use,
{state,rabbit_variable_queue,[{10,{vqstate,{0,{[],[]}},{0,{[],[]}},{delta,undefined,0,0,undefined},...}},...],...})
line 396 in gen_server2:terminate/3 line 1161
It seems to be due to messages posted to a queue setup like this filled with 500k+ messages :
Thanks for your help !
I filed this bug and opened these pull requests to fix this issue - 3.7.x PR, master PR. This fix will ship in RabbitMQ 3.7.4.
In the future, it would be preferable to discuss or report issues on the mailing list as the RabbitMQ core team monitors it daily.
Thanks for reporting this issue and for using RabbitMQ.

Apache ActiveMQ Server startup issue in windows 7

jvm 1 | WARN | Store limit is 102400 mb, whilst the data directory: C:\apach
e-activemq-5.8.0\bin\win32\..\..\data\kahadb only has 44093 mb of usable space
jvm 1 | ERROR | Temporary Store limit is 51200 mb, whilst the temporary data
directory: C:\apache-activemq-5.8.0\bin\win32\..\..\data\localhost\tmp_storage o
nly has 44093 mb of usable space
It's telling you that your configured limits don't fit with the amount of disk space you have available at the store location. This can lead to Broker failure in older versions as the limits are not lowered automatically to match the disk space, in the latest release the broker will lower the limits. When you see this it means you should either rethink you store location, or your broker config.

Azure Cloud service returns error 500 after 30 hours

I have a strange problem. I have an MVC 4 application (cloud service) on MS Azure. Application after deployment works fine but after 24-30 hours is returning an error 500. Then I have to reboot the instance. Currently it is running on the machine size S, I have 900 megabytes of free memory and the CPU is at about 3%. I have 1 instance. OS family = 3 (because of .NET framework 4.5)... Any ideas what is going on?
I have found it. Thanks for idea about app pool. Application pool is automatically recycling after 29 hours (default settings). So I recycled it manually and get error:
Could not load file or assembly 'file:///D:\Program Files (x86)\Reference Assemblies\Microsoft\Framework.NETFramework\v4.5.1\System.Data.Entity.dll'
From log I found that the reason is Entity Framework Profiler. I forgot remove it before deploying the app.