YARN Architecture of Hadoop 2.0 - hadoop-yarn

From below link of Apache Hadoop site, I learn that
ApplicationMaster has the responsibility of negotiating appropriate
resource containers from the Scheduler (ResourceManager)
and also learn that
ApplicationsManager negotiating the first container for executing the
ApplicationMaster
Link : http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
So here is my confusion.
If ApplicationMaster has the responsilibility to request ResourceManager for Container, then Who is creating the first container and what is the process to create the first container for executing the ApplicationMaster?
Is there anyone giving and request to create the first container?
What are the resonsibilities of the first Container? First Container only executes the ApplicationMaster or it is also behaving like other Resource Container?
Please let me know if anyone has the idea regarding this.

First of all, you are confusing the terms ApplicationManager and ApplicationMaster. They are not the same, have a look at my answer to understand difference between Application Manager and Application Master in YARN.
Answers to your questions are given below:
YarnClient has the responsibility to submit the application to ResourceManager, it sends an ApplicationSubmissionContext object to ResourceManager, which represents all of the information needed by the ResourceManager to launch the ApplicationMaster for an application.
Yes, YarnClient does that!
First Container is the Application Master, its job is to request the resources(containers) from ResourceManager and make application level decisions. If a sufficient number of containers (defined by the logic in your ApplicationMaster) are provided by the ResourceManager, then ApplicationMaster can go ahead and launch the application code on containers. FurtherMore, ApplicationMaster keeps track of failed containers and relauch them or terminates the application(kills all other containers), again based on the logic of your ApplicationMaster.
To understand the internals of Hadoop YARN, i would suggest you to read YARN paper or if you have more time you can read a book on Hadoop YARN.

Related

Memory leak related to OptionsMonitor<LoggerFilterOptions> and Serilog

I have what looks to be a memory leak in a production web service related to OptionsMonitor and Serilog. Over a period of about 10 days the memory usage gradually increases from starting point of ~200MB up to ~1GB at what point it hits Kubernetes limits and Out of Memory Exceptions start being thrown. K8s then restarts the pod as it is seen as unhealthy.
asp.net core 3.1 alpine docker image
Serilog 2.9.0
Microsoft.Extensions.Logging 3.1.3
Running on Amazon AWS EKS Kubernetes 1.20
I got some dumps from the production container and was able to analyse with JetBrains dotMemory tool. It is pointing to OptionsMonitor having the Largest retained size. There are ~300k instances of Serilog.Core.Logger being retained in memory also that are attached to the OptionsMonitor onChange event that gets fired. I think this happens when configuration updates. Although there is nothing explicitly updating the configuration files so I am not sure why this keeps firing. I am not sure why these objects are not being released from memory. Here is a couple of screenshots from dotMemory with the overview and then a retention graph which shows Serilog.Core.Logger attached to OptionsMonitor. Any ideas what is going on here?
enter image description here enter image description here
For anyone that comes across this I eventually figured it out myself. The serilog integration being used was not the standard one that overrides the defualt LoggerFactory with the SerilogLoggerFactory. The standard LoggerFactory was still being used which has some code to register to any changes made in the config files and any log filtering changes. That seems to cause an issue when only the LoggerProvider is replaced with the SerilogLoggerProvider. Something not getting cleaned up. The SerilogLoggerFactory does not register for any changes to LogFilerOptions so once I changed to the standard Serilog integration the memory issue went away.

AWS EKS node group migration stopped sending logs to Kibana

I encounter a problem while using EKS with fluent bit and I will be grateful for the community help, first I'll describe the cluster.
We are running EKS cluster in a VPC that had an unmanaged node group.
The EKS cluster network configuration is marked as "public and private" and
using fluent-bit with Elasticsearch service we show logs in Kibana.
We've decided that we want to move to managed node group in that cluster and therefore migrated from the unmanaged node group to a managed node group successfully.
Since our migration we cannot see any logs in Kibana, when getting the logs manually from the fluent bit pods there are no errors.
I toggled debug level logs for fluent bit to get better look at it.
I can see that fluent-bit gathers all the log files and then I saw that we get messages:
[debug] [out_es] HTTP Status=403 URI=/_bulk
[debug] [retry] re-using retry for task_id=63 attemps=3
[debug] [sched] retry=0x7ff56260a8e8 63 in 321 seconds
Furthermore, we have managed node group in other EKS clusters but we did not migrate to them they were created with managed node group.
The created managed node group were created from the same template we have from working managed node group with the only difference is the compute power.
The template has nothing special in it except auto scale.
I compared between the node group IAM role of working node group logs and my non working node group and the Roles seems to be the same.
As far for my fluent bit configuration I have the same configuration in few EKS clusters and it works so I don't think that the root cause but if anyone thinks something else I can add it if requested.
Someone had that kind of problem? why node group migration could cause such issue?
Thanks in advance!
Lesson learned, always look at the access policy of the resource you are having issue with, maybe it does not match your node group role

IBM APIConnect - task security-appID

I have an instance of APIConnect on premise.
Analyzing the logs, I have seen the task called "security-appID" moving from 10ms execution time to 200ms execution time.
What is the meaning of this task?
This task I believe offloads application security requests to other integrations if you have it so configured. It does not have anything to do necessarily with apiconnect, it is probably related to your bluemix ID, dashboard or landing page and how that is setup. You can probably find more information about it in the BMX docs. https://console.dys0.bluemix.net/docs/services/appid/existing.html#adding-app-id-to-an-existing-app

How to submit code to a remote Spark cluster from IntelliJ IDEA

I have two clusters, one in local virtual machine another in remote cloud. Both clusters in Standalone mode.
My Environment:
Scala: 2.10.4
Spark: 1.5.1
JDK: 1.8.40
OS: CentOS Linux release 7.1.1503 (Core)
The local cluster:
Spark Master: spark://local1:7077
The remote cluster:
Spark Master: spark://remote1:7077
I want to finish this:
Write codes(just simple word-count) in IntelliJ IDEA locally(on my laptp), and set the Spark Master URL to spark://local1:7077 and spark://remote1:7077, then run my codes in IntelliJ IDEA. That is, I don't want to use spark-submit to submit a job.
But I got some problem:
When I use the local cluster, everything goes well. Run codes in IntelliJ IDEA or use spark-submit can submit job to cluster and can finish the job.
But When I use the remote cluster, I got a warning log:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
It is sufficient resources not sufficient memory!
And this log keep printing, no further actions. Both spark-submit and run codes in IntelliJ IDEA result the same.
I want to know:
Is it possible to submit codes from IntelliJ IDEA to remote cluster?
If it's OK, does it need configuration?
What are the possible reasons that can cause my problem?
How can I handle this problem?
Thanks a lot!
Update
There is a similar question here, but I think my scene is different. When I run my codes in IntelliJ IDEA, and set Spark Master to local virtual machine cluster, it works. But I got Initial job has not accepted any resources;... warning instead.
I want to know whether the security policy or fireworks can cause this?
Submitting code programatically (e.g. via SparkSubmit) is quite tricky. At the least there is a variety of environment settings and considerations -handled by the spark-submit script - that are quite difficult to replicate within a scala program. I am still uncertain of how to achieve it: and there have been a number of long running threads within the spark developer community on the topic.
My answer here is about a portion of your post: specifically the
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources
The reason is typically there were a mismatch on the requested memory and/or number of cores from your job versus what were available on the cluster. Possibly when submitting from IJ the
$SPARK_HOME/conf/spark-defaults.conf
were not properly matching the parameters required for your task on the existing cluster. You may need to update:
spark.driver.memory 4g
spark.executor.memory 8g
spark.executor.cores 8
You can check the spark ui on port 8080 to verify that the parameters you requested are actually available on the cluster.

In YARN what is the difference between a managed and an unmanaged Application Manager

I'm experimenting with the Distributed Shell example in YARN 2.2 and am hoping that someone can clarify what the difference between a managed and and an un-managed application manager is?
For example the following lines appear in the client code
// unmanaged AM
appContext.setUnmanagedAM(true);
but I am unable to find documentation explaining the difference this line makes to the execution behaviour.
Many thanks.
The setUnmanagedAM(true) is used for debugging purposes i.e. it runs an application manager in local mode and does not submit it to a cluster so it is easier to step into code and debug.
You can see it in use in the hadoop-yarn-applications-unmanaged-am-launcher.jar that ships with yarn
Check the respective JIRA tickets: JIRA-420 and JIRA-419 (client side)
Currently, the RM itself manages the AM by allocating a container for it and negotiating the launch on the NodeManager and manages the AM lifecycle. Thereafter, the AM negotiates resources with the RM and launches tasks to do the real work.
It would be a useful improvement to enhance this model by allowing the AM to be launched independently by the client without requiring the RM. These AM's would be launched on a gateway machine that can talk to the cluster. This would open up new use cases such as the following
1) Easy debugging of AM, specially during initial development. Having the AM launched on an arbitrary cluster node makes it hard to looks at logs or attach a debugger to the AM. If it can be launched locally then these tasks would be easier.
2) Running AM's that need special privileges that may not be available on machines managed by the NodeManager
Blog post with more implementation details on unmanaged AM: click-me
Example of how Impala manages its resources with the help of unmanaged applications: Llama