Ignite on YARN - IGNITE_RUN_CPU_PER_NODE not obeyed - hadoop-yarn

When I start the Ignite on Yarn application with one CPU per node it works as expected and launches containers, however when I try to start it with 8 cores per node, the containers in YARN launch and get killed immediately, leaving only the AM running.
These settings is what works:
IGNITE_RUN_CPU_PER_NODE=1
IGNITE_MEMORY_PER_NODE=122880
IGNITE_NODE_COUNT=3
These settings won't work:
IGNITE_RUN_CPU_PER_NODE=8
IGNITE_MEMORY_PER_NODE=122880
IGNITE_NODE_COUNT=3
I have also tried other numbers of CPUs per node but it will only work with 1.
How can I make this work? The logs are not existent for each of the containers launched and killed. I can only see from the YARN RM that they were allocated and released.

Related

How to interrupt triggered gitlab pipelines

I'm using a webhook to trigger my Gitlab pipeline. Sometimes, this trigger is triggered a bunch of times, but my pipelines only has to run the last one (static site generation). Right now, it will run as many pipelines as I have triggered. My pipelines takes 20 minutes so sometimes it's running the rest of the day, which is completely unnecessary.
https://docs.gitlab.com/ee/ci/yaml/#interruptible and https://docs.gitlab.com/ee/user/project/pipelines/settings.html#auto-cancel-pending-pipelines only work on pushed commits, not on triggers
A similar problem is discussed in gitlab-org/gitlab-foss issue 41560
Example of a use-case:
I want to always push the same Docker "image:tag", for example: "myapp:dev-CI". The idea is that "myapp:dev-CI" should always be the latest Docker image of the application that matches the HEAD of the develop branch.
However if 2 commits are pushed, then 2 pipelines are triggered and executed in paralell. Then the latest triggered pipeline often finishes before the oldest one.
As a consequence the pushed Docker image is not the latest one.
Proposition:
As a workaround for *nix you can get running pipelines from API and wait until they finished or cancel them with the same API.
In the example below script checks for running pipelines with lower id's for the same branch and sleeps.
jq package is required for this code to work.
Or:
Create a new runner instance
Configure it to run jobs marked as deploy with concurrency 1
Add the deploy tag to your CD job.
It's now impossible for two deploy jobs to run concurrently.
To guard against a situation where an older pipeline may run after a new one, add a check in your deploy job to exit if the current pipeline ID is less than the current deployment.
Slight modification:
For me, one slight change: I kept the global concurrency setting the same (8 runners on my machine so concurrency: 8).
But, I tagged one of the runners with deploy and added limit: 1 to its config.
I then updated my .gitlab-ci.yml to use the deploy tag in my deploy job.
Works perfectly: my code_tests job can run simultaneously on 7 runners but deploy is "single threaded" and any other deploy jobs go into pending state until that runner is freed up.

Selenium crashing in Docker due to Browsing context has been discarded

How do you run Selenium based tests inside Docker?
I'm trying to get some Python+Selenium tests, which use Firefox and Geckodriver, to run under an Ubuntu 18 Docker image.
My docker-compose.yml file is simply:
version: "3.5"
services:
app_test:
build:
context: .
shm_size: '4gb'
mem_limit: 4096MB
dockerfile: Dockerfile.test
Unfortunately, most tests are failing with errors like:
selenium.common.exceptions.NoSuchWindowException: Message: Browsing context has been discarded
The few search results I can find mentioning this error suggest it's because of low memory. The server I'm running the tests on has 8GB of total memory, although I also tested on a machine with 32GB and received the same error.
I also added a call to print the output of top before each test, and it's showing virtually no memory usage, so I'm not sure what would be causing the test to crash due to insufficient memory.
Some articles suggested adding the shm_size and mem_limit lines, but those had no effect.
I've also tried different versions of Firefox, from the most recent 71 version to the older ESR releases, to rule out it's not a bug due to incompatible versions of Firefox+Selenium+Geckodriver. I'm otherwise following this compatibility table.
What is causing this error and how do I fix it?
Root cause could be running out of RAM memory.
To fix it run the docker container adding --shm-size.
Example:
--shm-size="2G"

Flink job on EMR runs only on one TaskManager

I am running EMR cluster with 3 m5.xlarge nodes (1 master, 2 core) and Flink 1.8 installed (emr-5.24.1).
On master node I start a Flink session within YARN cluster using the following command:
flink-yarn-session -s 4 -jm 12288m -tm 12288m
That is the maximum memory and slots per TaskManager that YARN let me set up based on selected instance types.
During startup there is a log:
org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=12288, taskManagerMemoryMB=12288, numberTaskManagers=1, slotsPerTaskManager=4}
This shows that there is only one task manager. Also when looking at YARN Node manager I see that there is only one container running on one of the core nodes. YARN Resource manager shows that the application is using only 50% of cluster.
With the current setup I would assume that I can run Flink job with parallelism set to 8 (2 TaskManagers * 4 slots), but in case that submitted job has set parallelism to more than 4, it fails after a while as it could not get desired resources.
In case the job parallelism is set to 4 (or less), the job runs as it should. Looking at CPU and memory utilisation with Ganglia it shows that only one node is utilised, while the other flat.
Why is application run only on one node and how to utilise the other node as well? Did I need to set up something on YARN that it would set up Flink on the other node as well?
In previous version of Flik there was startup option -n which was used to specify number of task managers. The option is now obsolete.
When you're starting a 'Session Cluster', you should see only one container which is used for the Flink Job Manager. This is probably what you see in the YARN Resource Manager. Additional containers will automatically be allocated for Task Managers, once you submit a job.
How many cores do you see available in the Resource Manager UI?
Don't forget that the Job Manager also uses cores out of the available 8.
You need to do a little "Math" here.
For example, if you would have set the number of slots to 2 per TM and less memory per TM, then submitted a job with parallelism of 6 it should have worked with 3 TMs.

yarn not getting nodes

This is in AWS EMR cluster with 2 task nodes and a Master.
I'm trying the hello-samza that launches a yarn job. The job gets stuck in ACCEPTED STATE. I looked in other posts and it seems that my yarn getting no nodes. Any help on what yarn not getting task nodes will help.
[hadoop#xxx hello-samza]$ deploy/yarn/bin/yarn node -list
17/04/18 23:30:45 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
Total Nodes:0
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
[hadoop#xxx hello-samza]$ deploy/yarn/bin/yarn application -list -appStates ALL
17/04/18 23:26:30 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
Total number of applications (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1492557889328_0001 wikipedia-parser_1 Samza hadoop default ACCEPTED UNDEFINED 0% N/A
I made a complete answer for a similar case I've been experiencing: have a look at it, it might be this kind of conf issue
It seems like the nodemanagers are not running on either node (either not started at all or exited with error). Use jps command to check if all the daemons associated with YARN are running on the two nodes. Additionally, check both nodemanager logs to see if any exceptions might have killed it.

How to get around memory error with karma & phantomjs

We're running tests using karma and phantomjs Last week, our tests mysteriously started crashing phantomJS with an error of -1073741819.
Based on this thread for Chutzpah it appears that code indicates a native memory failure with PhantomJS.
Upon further investigation, we are consistently seeing phantom crash around 750MB of memory.
Is there a way to configure Karma so that it does not run up against this limit? Or a way to tell it to flush phantom?
We only have around 1200 tests so far. We're about 1/4 of the way through our project, so 5000 UI tests doesn't seem out of the question.
Thanks to the StackOverflow phenomenon of posting a question and quickly discovering an answer, we solved this by adding gulp tasks. Before we were just running karma start at the command line. This spun up a single instance of phantomjs that crashed when 750MB was reached.
Now we have a gulp command for each one of our sections of tests, e.g. gulp common-tests and gulp admin-tests and gulp customer-tests
Then a single gulp karma that runs each of those groupings. This allows each gulp command to have its own instance of phantom, and therefore stay underneath that threshold.
We ran into similar issue. Your approach is interesting and certainly side steps the issue. However, be prepared to face it again later.
I've done some investigation and found the cause of memory growth (at least in our case). Turns out when you use:
beforeEach(inject(SomeActualService)){ .... }
the memory taken up by SomeActualService does not get released at the end of the describe block and if you have multiple test files where you inject the same service (or other injectable objects) more memory will be allocated for it again.
I have a couple of ideas on how to avoid this:
1. create mock objects and never use inject to get real objects unless you are in the test that tests that module. This will require writing tons of extra code.
2. Create your own tracker (for tests only) for injectable objects. That way they can be loaded only once and reused between test files.
Forgot to mention: We are using angular 1.3.2, Jasmine 2.0 and hit this problem around 1000 tests.
I was also running into this issue after about 1037 tests on Windows 10 with PhantomJS 1.9.18.
It would appear as ERROR [launcher]: PhantomJS crashed. after the RAM for the process would exceed about 800-850 MB.
There appears to be a temporary fix here:
https://github.com/gskachkov/karma-phantomjs2-launcher
https://www.npmjs.com/package/karma-phantomjs2-launcher
You install it via npm install karma-phantomjs2-launcher --save-dev
But then need to use it in karma.conf.js via
config.set({
browsers: ['PhantomJS2'],
...
});
This seems to run the same set of tests while only using between 250-550 MB RAM and without crashing.
Note that this fix works out of the box on Windows and OS X, but not Linux (PhantomJS2 binaries won't start). This affects pushes to Travis CI.
To work around this issue on Debian/Ubuntu:
sudo apt-get install libicu52 libjpeg8 libfontconfig libwebp5
This is a problem with PhantomJS. According to another source, PhantomJS only runs the garbage collector when the page is closed, and this only happens after your tests run. Other browsers work fine because their garbage collectors work as expected.
After spending a few days on the issue, we concluded that the best solution was to split tests into groups. We had grunt create a profile for each directory dynamically and created a command that runs all those profiles. For all intents and purposes, it works just the same.
We had a similar issue on linux (ubuntu), that turned out to be the amount of memory segments that the process can manage:
$ cat /proc/sys/vm/max_map_count
65530
Then run this:
$ sudo bash -c 'echo 6553000 > /proc/sys/vm/max_map_count'
Note the number was multiplied by 100.
This will change the session settings. If it solves the problem, you can set it up for all future sessions:
$ sudo bash -c 'echo vm.max_map_count = 6553000 > /etc/sysctl.d/60-max_map_count.conf'
Responding to an old question, but hopefully this helps ...
I have a build process which a CI job runs in a command line only linux box. So, it seems that PhantomJS is my only option there. I have experienced this memory issue locally on my mac, but somehow it doesn't happen on the linux box. My solution was to add another test command to my package.json to run karma using Chrome, and run that locally to run my tests. When pushed up, Jenkins would kick off the regular test command, running PhantomJS.
Install this plugin: https://github.com/karma-runner/karma-chrome-launcher
Add this to package.json
"test": "karma start",
"test:chrome": "karma start --browsers Chrome"