Is mapred-site.xml totally useless in Hadoop 2 with YARN - hadoop-yarn

I found from many Hadoop related pages that in Hadoop 2 lots of mapred-site configurations are not working anymore and instead yarn-site.xml is used. But if so why still mapred-site.xml is maintained?

For Hadoop v2, mapred-site.xml is still in use. You have a new property named mapreduce.framework.name, its value is the runtime framework for executing MapReduce jobs, it can be one of local, classic or yarn. If you use local, your mapper and reducer processes will be executed in a same JVM, normally you do this for debugging purposes(only one reducer could be launched with local runtime). If you use classic the tradition MR v1 will be in action and MR v2 will be in action if you mention yarn.
When you are using YARN runtime for MapReduce jobs, your mapper and reducer processes will be spawned in containers, mapred-site.xml file could be used to specify system resources(e.g memory and v-cpu) for mapreduce containers. Remember that the upper limit for memory and v-cpu for containers is set in yarn-site.xml. Have a look at the following properties in mapred-site.xml.
mapreduce.map.memory.mb
mapreduce.map.cpu.vcores
mapreduce.reduce.memory.mb
mapreduce.reduce.cpu.vcores
mapred-site.xml
yarn-site.xml

mapred-site.xml is still in use. Look here for all default values.
Look here for new properties in Hadoop V2 as a replacement for deprecated properties of Hadoop V1.
Few examples:
mapred.map.child.env mapreduce.map.env
mapred.map.child.java.opts mapreduce.map.java.opts
mapred.map.child.log.level mapreduce.map.log.level
mapred.map.max.attempts mapreduce.map.maxattempts
mapred.map.output.compression.codec mapreduce.map.output.compress.codec
mapred.mapoutput.key.class mapreduce.map.output.key.class
mapred.mapoutput.value.class mapreduce.map.output.value.class
mapred.mapper.regex.group mapreduce.mapper.regexmapper..group
mapred.mapper.regex mapreduce.mapper.regex
mapred.map.task.debug.script mapreduce.map.debug.script
mapred.map.tasks mapreduce.job.maps

Related

Apache Flink to use S3 for backend state and checkpoints

Background
I was planning to use S3 to store the Flink's checkpoints using the FsStateBackend. But somehow I was getting the following error.
Error
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
Flink version: I am using Flink 1.10.0 version.
I have found the solution for the above issue, so here I am listing it in steps that are required.
Steps
We need to add some configs in the flink-conf.yaml file which I have listed below.
state.backend: filesystem
state.checkpoints.dir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
state.backend.fs.checkpointdir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
s3.access-key: XXXXXXXXXXXXXXXXXXX #your-access-key
s3.secret-key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx #your-secret-key
s3.endpoint: http://127.0.0.1:9000 #your-endpoint-hostname (I have used Minio)
After completing the first step we need to copy the respective(flink-s3-fs-hadoop-1.10.0.jar and flink-s3-fs-presto-1.10.0.jar) JAR files from the opt directory to the plugins directory of your Flink.
E.g:--> 1. Copy /flink-1.10.0/opt/flink-s3-fs-hadoop-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-hadoop/flink-s3-fs-hadoop-1.10.0.jar // Recommended for StreamingFileSink
2. Copy /flink-1.10.0/opt/flink-s3-fs-presto-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-presto/flink-s3-fs-presto-1.10.0.jar //Recommended for checkpointing
Add this in checkpointing code
env.setStateBackend(new FsStateBackend("s3://s3-bucket/checkpoints/"))
After completing all the above steps re-start the Flink if it is already running.
Note:
If you are using both(flink-s3-fs-hadoop and flink-s3-fs-presto) in Flink then please use s3p:// specificly for flink-s3-fs-presto and s3a:// for flink-s3-fs-hadoop instead of s3://.
For more details click here.

Increase memory allocated to application deployed to payara micro

Am running my application from a payara micro UberJar and would like to increase the memory allocated to the application. How can I do this at the point of creating the uberJar?
There are a couple of ways you can do this. The first way I'll mention is the preferred way:
1. Use asadmin commands
The latest edition of Payara Micro introduces an option called --postbootcommandfile which allows you to run asadmin commands against Payara Micro. Your file should include something like this:
delete-jvm-options -Xmx=512m
create-jvm-options -Xmx=1g
create-jvm-options -Xms=1g
You will need to make sure you delete the existing options before applying new ones.
You can then use the file similar to this:
java -jar payara-micro.jar --postbootcommandfile myCommands.txt --deploy myApp.war --outputuberjar myPayaraMicroApp.jar
Your settings should now persist in the resulting Uber JAR.
2. Supply a custom domain.xml
The alternative to this would be modifying a domain.xml of your own and overriding the in-built domain.xml with your own.
You can use the --rootdir option to get Payara Micro to output its configuration to a directory so you can make changes there. This process is outlined in this blog:
http://blog.payara.fish/working-with-external-configuration-files-in-payara-micro
If you already have a custom domain.xml to hand, you can use the --domainconfig property to supply it, as follows:
java -jar payara-micro.jar --domainconfig myCustomDomain.xml --deploy myApp.war --outputuberjar myPayaraMicroApp.jar
After following either of these methods, you can simply start the resulting JAR and all the settings and configuration will be applied:
java -jar myPayaraMicroApp.jar
Payara Micro uber JAR is a plain JAR and it doesn't start a new JVM like Payara Server does. Therefore there's no way to modify JVM memory settings from within the JAR as the JVM is already started. Although it's possible to add the JVM settings into the Payara Micro configuration, they are ignored and not applied. Those configuration values are only used within Payara Server.
With Payara Micro uber JAR, you need to specify the JVM options on the command line, like this:
java -Xmx=1g -Xms=1g -jar myPayaraMicroApp.jar
If you need to specify JVM arguments in the uber JAR, you need to use a solution like capsule.io to wrap the JAR into a launcher JAR that would spawn a separate JVM for Payara Micro and pass the arguments to it.

What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?

I would like to know the relation between the mapreduce.map.memory.mb and mapred.map.child.java.opts parameters.
Is mapreduce.map.memory.mb > mapred.map.child.java.opts?
mapreduce.map.memory.mb is the upper memory limit that Hadoop allows to be allocated to a mapper, in megabytes. The default is 512.
If this limit is exceeded, Hadoop will kill the mapper with an error like this:
Container[pid=container_1406552545451_0009_01_000002,containerID=container_234132_0001_01_000001]
is running beyond physical memory limits. Current usage: 569.1 MB of
512 MB physical memory used; 970.1 MB of 1.0 GB virtual memory used.
Killing container.
Hadoop mapper is a java process and each Java process has its own heap memory maximum allocation settings configured via mapred.map.child.java.opts (or mapreduce.map.java.opts in Hadoop 2+).
If the mapper process runs out of heap memory, the mapper throws a java out of memory exceptions:
Error: java.lang.RuntimeException: java.lang.OutOfMemoryError
Thus, the Hadoop and the Java settings are related. The Hadoop setting is more of a resource enforcement/controlling one and the Java is more of a resource configuration one.
The Java heap settings should be smaller than the Hadoop container memory limit because we need reserve memory for Java code. Usually, it is recommended to reserve 20% memory for code. So if settings are correct, Java-based Hadoop tasks should never get killed by Hadoop so you should never see the "Killing container" error like above.
If you experience Java out of memory errors, you have to increase both memory settings.
The following properties let you specify options to be passed to the JVMs running your tasks. These can be used with -Xmx to control heap available.
Hadoop 0.x, 1.x (deprecated) Hadoop 2.x
------------------------------- --------------------------
mapred.child.java.opts
mapred.map.child.java.opts mapreduce.map.java.opts
mapred.reduce.child.java.opts mapreduce.reduce.java.opts
Note there is no direct Hadoop 2 equivalent for the first of these; the advice in the source code is to use the other two. mapred.child.java.opts is still supported (but is overridden by the other two more-specific settings if present).
Complementary to these, the following let you limit total memory (possibly virtual) available for your tasks - including heap, stack and class definitions:
Hadoop 0.x, 1.x (deprecated) Hadoop 2.x
------------------------------- --------------------------
mapred.job.map.memory.mb mapreduce.map.memory.mb
mapred.job.reduce.memory.mb mapreduce.reduce.memory.mb
I suggest setting -Xmx to 75% of the memory.mb values.
In a YARN cluster, jobs must not use more memory than the server-side config yarn.scheduler.maximum-allocation-mb or they will be killed.
To check the defaults and precedence of these, see JobConf and MRJobConfig in the Hadoop source code.
Troubleshooting
Remember that your mapred-site.xml may provide defaults for these settings. This can be confusing - e.g. if your job sets mapred.child.java.opts programmatically, this would have no effect if mapred-site.xml sets mapreduce.map.java.opts or mapreduce.reduce.java.opts. You would need to set those properties in your job instead, to override the mapred-site.xml. Check your job's configuration page (search for 'xmx') to see what values have been applied and where they have come from.
ApplicationMaster memory
In a YARN cluster, you can use the following two properties to control the amount of memory available to your ApplicationMaster (to hold details of input splits, status of tasks, etc):
Hadoop 0.x, 1.x Hadoop 2.x
------------------------------- --------------------------
yarn.app.mapreduce.am.command-opts
yarn.app.mapreduce.am.resource.mb
Again, you could set -Xmx (in the former) to 75% of the resource.mb value.
Other configurations
There are many other configurations relating to memory limits, some of them deprecated - see the JobConf class. One useful one:
Hadoop 0.x, 1.x (deprecated) Hadoop 2.x
------------------------------- --------------------------
mapred.job.reduce.total.mem.bytes mapreduce.reduce.memory.totalbytes
Set this to a low value (10) to force shuffle to happen on disk in the event that you hit an OutOfMemoryError at MapOutputCopier.shuffleInMemory.

Why 'mapred-site.xml' is not included in the latest Hadoop 2.2.0?

Latest build of Hadoop provides mapred-site.xml.template
Do we need to create a new mapred-site.xml file using this?
Any link on documentation or explanation related to Hadoop 2.2.0 will be much appreciated.
I believe it's still required. For our basic Hadoop 2.2.0 2-node cluster setup that we have working I did the following from the setup documentation.
"
From the base of the Hadoop installation, edit the etc/hadoop/mapred-site.xml file. A new
configuration option for Hadoop 2 is the capability to specify a framework name for
MapReduce, setting the mapreduce.framework.name property. In this install we will use the
value of "yarn" to tell MapReduce that it will run as a YARN application.
First, copy the template file to the mapred-site.xml.
cp mapred-site.xml.template mapred-site.xml
Next, copy the following into Hadoop etc/hadoop/mapred-site.xml file and remove the original empty tags.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
"
Wrt documentation, I found this the most useful. Also, etc/hosts configs for cluster setup and other cluster related configs were a bit hard to figure out.

In what mode is Hive installed?

Does hive installation have any specific mode?
Like for example, Hadoop installation has 3 modes: standalone, pseudo-distributed and fully distributed.
Similarly does Hive has any specific type of distribution?
Can Hive be installed in distributed mode?
Hive actually provides you the option to run queries in 2 modes :
1- Map-Reduce mode
2- Local mode
Normally Hive compiler generates map-reduce jobs for most queries under the hood. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:
mapred.job.tracker
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also provided you the ability to run map-reduce jobs locally on the your standalone workstation. In order to run Hive queries in local mode you need to do this :
hive> SET mapred.job.tracker=local;
Details can be found here.