YARN Reserved Memory Issue - hadoop-yarn

When using FIFO scheduler with YARN(FIFO is default right?), I found out YARN reserve some memory/CPU to run the application. Our application doesn't need to reserve any of these, since we want fixed number of cores to do the tasks depending on user's account. This reserved memory makes our calculation inaccurate, so I am wondering if there is any way to solve this. If removing this is not possible, we are trying to scale the cluster(we are using dataproc on GCP), but without graceful decommission, scaling down the cluster is shutting down the job.
Is there any way to get rid of reserved memory?
If not, is there any way to implement graceful decommission to yarn
2.8.1? I found out cases with 3.0.0 alpha(GCP only has beta version), but couldn't find any working instruction for 2.8.1.'
Thanks in advance!

Regarding 2, Dataproc supports YARN graceful decommissioning because Dataproc 1.2 uses Hadoop 2.8.

Related

Google cloud GPU machines reboot abruptly

When training a model on the GPU machine, it get interrupted due to some system patch process. Since Google cloud GPU machines do not have an option of live migration, it is painful task to restart the training every time this happens. Google has clearly mentioned that there is no way around this but to restart the machines in this Doc.
Is there a clever way to detect if the machine is rebooted and resume the training automatically.
Sometimes it also happens that due to some kernel update, the CUDA drivers stop working and GPU is not visible and CUDA drivers need a re-installation. So writing a startup script to resume the training is also not a bulletproof solution.
Yes there is. If you use tensorflow, you can use it's checkpointing feature to save your progress and pick up where you left off.
One great example of this is provided here: https://github.com/GoogleCloudPlatform/ml-on-gcp/blob/master/gce/survival-training/README-tf-estimator.md

Ambari agent memory leak can't fix

Ambari-2.1.2,HDP-2.3.2.0-2950
Noticed to many resident memory for the agents on a cluster that is running for few day.
I found a solution.
https://community.hortonworks.com/questions/21253/ambari-agent-memory-leak-or-taking-too-much-memory.html https://issues.apache.org/jira/browse/AMBARI-17539
I have modified the code for main.py, but the agent still has memory leak. The following is the code I added
[main.py]https://community.hortonworks.com/storage/attachments/34791-mainpy.txt
If I remember correctly, in older Ambari versions, there was a memory leak because a new log formatter was created every time the logging has been invoked.
Anyway, the best solution would be to upgrade to newer Ambari version. Stack 2.3 is widely supported

Openstack recover orphaned instances

I'm using Openstack Havana with one compute node based on kvm and a controller node running in a VM.
After a bad hardware failure I got into a situation where the controller is aware of a subset of the instances (preceding a certain date) and completely lost the newer instances. I suppose we had to restart from an older backup instance of the controller.
All the information about the instances is still available on the compute node (disk, xml) and they even still appear in virsh list --all.
Is there a way to just re-import them into the controller? Maybe by sql or some nova command line?
Thanks.
Ok. We solved the issue the rough way. Converting the disk file produced for OpenStack (OS) instances to VDI (thanks qemu-img) we then run the suitable glance command to import the VDI as an image into OS. From the dashboard we then created an instance on that image and reassigned our floating-ip.
Anyone has counter-indications to this?
Thanks.

IntelliJ IDEA Linux Mint Problems

When I run IntelliJ on Linux Mint, I get an warning on the terminal screen.
/IDEA/idea-IU-141.178.9/bin $ ./idea.sh
[ 47358] WARN - om.intellij.util.ProfilingUtil - Profiling agent is not enabled. Add -agentlib:yjpagent to idea.vmoptions if necessary to profile IDEA.
[ 63287] WARN - .ExternalResourceManagerExImpl - Cannot find standard resource. filename:/META-INF/tapestry_5_3.xsd class=class com.intellij.javaee.ResourceRegistrarImpl, classLoader:null
I'm using Java 8 64-bit. I thing that this error is leading to some CSS loading problem.
Does anyone know what's going on with this?
It's not an error, it's a warning.
You don't have the built-in profiler enabled so that you can get diagnostics like CPU and memory usage, which are useful for when IntelliJ becomes unresponsive or sluggish.
Don't worry about it; if you don't encounter a lot of startup pain, then it's not anything critical. If you do require the profiler enabled, then you can follow the instructions here to add the appropriate run time flags to your executable.
does Mint use the oraclejdk? or the openjdk? intellij recommends the use of oraclejdk for Idea. it fixed at least one problem i had with it under fedora (at the cost of some disk space).

Does cloudera distribution of hadoop use the control scripts?

Are the control scripts (start-dfs.sh and start-mapred.sh) used by CDH to start daemons on the fully distributed cluster?
I downloaded and installed CDH5, but could not see the control scripts in the installation, and wondering how does CDH start the daemons on slave nodes?
Or since the daemons are installed as services, they do start with the system start-up. Hence there is no need for control scripts in CDH unlike apache hadoop.
Not as such, no. If I recall correctly, Cloudera Manager invokes the daemons using Supervisor (http://supervisord.org/). So, you manage the services in CM. CM itself runs agent processes as a service on each node, and you can find its start script in /etc/init.d. There is no need for you to install or start or stop anything. You install, deploy config, control and monitor servies in CM.