I am running an application using docker-compose.
One of the containers is a selenium/standalone-chrome image. I give is shm_size of 2g.
The application works fine when there is no high load. However, I have noticed that whenever there are concurrent requests to the selenium container (9 concurrent requests on a 8-core machine) Selenium fails silently. It just dies and stays dead. Subsequent request are not handled. There is nothing in the logs. The last message is:
17:41:00.083 INFO [RemoteSession$Factory.lambda$performHandshake$0] - Started new session 5da2cd57f4e8e4f80b907564d7352051 (org.openqa.selenium.chrome.ChromeDriverService)
I am monitoring the RAM and CPU usage using both docker stats and top. Ram is fine .. about 50% used. Using free -m shows shared memory at about 500m. The 8 cores are taking the load staying at around 80% most of the time. However, whenever the last request arrives - the processes just die out. CPU usage drops. Shared memory does not seem to be released though.
In order to make it work again, I have to restart the application. Otherwise, none of the subsequent requests are received or logged.
I suspect there might me some kind of limitation from the OS on the containers and once they start consuming resources the OS kills them, but to be fair, I have no idea what is going on.
Any help would be greatly appreciated.
Thanks!
Update:
Here is my docker-compose reference
selenium-chrome:
image: selenium/standalone-chrome
privileged: true
shm_size: 2g
expose:
- "4444"
This is what my logs look like when it hangs:
And after I kill the docker-compose process and restart it:
I have also tested different images. These screenshots are actually with image selenium/standalone-chrome:3.141.59-gold.
One last thing that puzzles me even more - I am using selenium for screenshots, and I have added webhook call in the java code if the process fails. I would expect it to fire if the selenium process dies, however, it seem the java does not consider the selenium connection dead and stays waiting until I docker-compose down. Then all the messages from the webhook are fired.
Update2:
Here is what I have tried and I know so far:
1. chrome driver version makes no difference
2. shm_size increase does not make any difference
3. jvm memory limit makes no difference - command: ["java", "-Xmx2048m", "-jar", "/opt/selenium/selenium-server-standalone.jar"]
4. always hangs on the same spot .. 8 concurrent processes on a 8 core machine
5. once dead, stays dead
6. lots of chrome processes hang there - ps -aux | grep chrome
6.1 if those processes are killed - sudo kill -9 $(ps aux | grep 'chrome' | awk '{print $2}'), the process does not start again and stays dead.
7. --no-sandbox option does not help
8. the java process is alive on the host - telnet ip 4444 -> connects succesfully
I suspect your selenium/standalone-chrome is implemented using Java technology.
And the container's JVM has a bounded max memory with JVM argument -Xmx2048m or similar value.
Research selenium JVM setup/configuration files.
What can happen is one or more of the options:
Container application crashed with out of memory, because its memory bound was reached. Solution: decrease JVM max memory bound to match container's max memory bound (maybe 2048m > 2g).
JVM application crashed with out of memory. Solution: increase JVM max memory bound to match container's max memory bound (maybe 2048m not sufficient for the task).
Container peaked its CPU utilization limit for a moment and crashed. I assume selenium implements massive parallelism (check its configuration). Solution: provide more compute power to the container, or decrease selenium parallelism functionality.
Note that periodic resource monitoring tools fail to identify peak resources stress. If the peak is momentary and sharp. So if the resources stress is building up gradually you can identify the breaking point.
Related
I'm trying to use an EC2 instance to run several Selenium chromedrivers in parallel. When I run just one, it works fine. But as soon as I start a second Selenium process, both processes fail (as in, page loads hit max timeouts after a couple minutes).
I'm using a t3.large instance, which has 8gb RAM and 5Gbps of network bandwidth. It's not a cheap instance and costs $2 per day. I'm surprised that with these specs, it can't handle two concurrent Selenium processes because my personal laptop has no problem handling 4+ Selenium processes.
Additional info: I'm using pyvirtualdisplay on the EC2 box.
Wondering if I'm missing something here that is causing the poor performance.
I am running Apache Guacamole on a Google Cloud Compute Engine f1-micro with CentOS 7 because it is free.
Guacamole runs fine for some time (an hour or so) then unexpectantly crashes. I get the ERR_CONNECTION_REFUSED error in Chrome and when running htop I can see that all of the tomcat processes have stopped. To get it running again I just have to restart tomcat.
I have a message saying "Instance "guac" is overutilized. Consider switching to the machine type: g1-small (1 vCPU, 1.7 GB memory)" in the compute engine console.
I have tried limiting the memory allocation to tomcat, but that didn't seem to work.
Any suggestions?
I think the reason for the ERR_CONNECTION_REFUSED is likely due to the VM instance falling short on resources and in order to keep the OS up, process manager shuts down some processes. SSH is one of those processes, and once you reboot the vm, resource will resume operation in full.
As per the "over-utilization" notification recommending g1-small (1 vCPU, 1.7 GB memory)", please note that, f1-micro is a shared-core micro machine type with 0.2 vCPU, 0.60 GB of memory, backed by a shared physical core and is only ideal for running smaller non-resource intensive applications..
Depending on your Tomcat configuration, also note that:
Connecting to a database is an intensive process.
Creating a Tomcat with Google Marketplace, the default VM setting is "VM instance: 1 vCPU + 3.75 GB memory (n1-standard-1) so upgrading to machine type: g1-small (1 vCPU, 1.7 GB memory) so should ideal in your case.
Why was g1 small machine type recommended. Please note that Compute Engine uses the same CPU utilization numbers reported on the Compute Engine dashboard to determine what recommendations to make. These numbers are based on the average utilization of your instances over 60-seconds intervals, so they do not capture short CPU usage spikes.
So, applications with short usage spikes might need to run on a larger machine type than the one recommended by Google, to accommodate these spikes"
In summary my suggestion would be to upgrade as recommended. Also note that, the rightsizing gives warnings when VM is underutilized or overutilized and in this case, it is recommending to increase your VM size due to overutilization and keep in mind that this is only a recommendation based on the available data.
My spark application is running on remote machine in our internal lab. To analyse the memory consumption of remote application, attached the remote application pid to JProfiler by using the 'attach mode' (with help of jpenable) from my local machine.
After attaching the remote application to JProfiler in local machine, the JProfiler showing only 5% of memory consumption of the remote machine but when we ran the 'top' command on remote Centos machine, the 'top' command showing the 72% of memory consumption. And I am unable to find the whole 72% consumption with JProfiler application.
Please help me to get the whole memory consumption (i.e., 72% of memory usage) statistics by using the JProfiler application.
top shows memory reserved by the JVM, not the actually used heap, so you cannot compare the two values.
In addition, the JVM uses native memory that does not show up in the heap. A Java profiler cannot analyze that memory.
I'm setting up a test infrastructure using Azure & Docker - Selenium HUB and Chrome Images
Running the latest version of Ubuntu on AZURE
System Configuration
Ubuntu:16*
Docker: Latest Version
RAM: 6GB
SSD: 120GB
Able to run automation script in the Chrome containers without any issue, if the number of containers is <=10.
When I scale up the numbers, entire system freezes and not responding and the tests are not running.
PS: I'm also mounting /dev/shm:/dev/shm when creating the containers.
What should be the optimal system configuration to run a minimum of 75 containers?
6GB RAM for 75 containers means ~ 80 MB/container. That too you want to run a firefox/chrome inside? Which may be running headless/without display but that doesn't mean they are not memory hungry.
You would need to park 500MB memory/container for such nodes. You can set a memory limit but as soon as your container goes above it, poof!!!. The container is dead and so is your browser and your test. Best is to either use Docker Swarm to deploy a self healing Selenium Grid
Or you can use https://github.com/zalando/zalenium as mentioned by #Leo Galluci.
PS: I wrote an article on how to setup Grid on swarm http://tarunlalwani.com/post/deploy-selenium-grid-using-docker-swarm-aws/. You can have a look at the same to get an idea at the horizontal scaling your grid
I'm running some websites on a dedicated Ubuntu web server. If I'm remembering correctly, it has 8 cores, 16GB memory, and running as a 64 bit Ubuntu. Content and files are delivered quickly to web browsers. Everything seems like a dream... until I run gzip or zip to backup an 8.6GB sized website.
When running gzip or zip, Apache stops delivering content. Internal server error messages are delivered until the compression process is complete. During the process, I can login via ssh without delays and run the top command. I can see that the zip process is taking about 50% CPU (I'm guessing that's 50% of a single CPU, not all 8?).
At first I thought this could be a log issue, with Apache logs growing out of control and not wanting to be messed with. Log files are under 5MB though and being rotated when they hit 5MB. Another current thought is that Apache only wants to run on one CPU and lets any other process take the lead. Not sure where to look to address that yet.
Any thoughts on how to troubleshoot this issue? Taking out all my sites while backups occur is not an option, and I can't seem to reproduce this issue on my local machines (granted, it's different hardware and configuration). My hopes are that this question is not to vague. I'm happy to provide additional details as needed.
Thanks for your brains in advance!
I'd suggest running your backup script under the "ionice" command. It will help prevent starving httpd from I/O.