Hive always run mapred jobs in local mode - hive

We are testing a multi node hadoop cluster (2.4.0) with Hive (0.13.0). The cluster works fine, but when we runa a query in hive, the mapred job are always executed locally.
For example:
Without hive-site.xml (in fact, without any configuration file other than defaults) we set mapred.job.tracker:
hive> SET mapred.job.tracker=192.168.7.183:8032;
And run a query:
hive> select count(1) from suricata;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
OpenJDK 64-Bit Server VM warning: You have loaded library /hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/04/29 12:48:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/29 12:48:02 WARN conf.Configuration: file:/tmp/hadoopuser/hive_2014-04-29_12-47-57_290_2455239450939088471-1/-local-10003/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/29 12:48:02 WARN conf.Configuration: file:/tmp/hadoopuser/hive_2014-04-29_12-47-57_290_2455239450939088471-1/-local-10003/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
Execution log at: /tmp/hadoopuser/hadoopuser_20140429124747_badfcce6-620e-4718-8c3b-e4ef76bdba7e.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2014-04-29 12:48:05,450 null map = 0%, reduce = 0%
.......
.......
2014-04-29 12:52:26,982 null map = 100%, reduce = 100%
Ended Job = job_local1983771849_0001
Execution completed successfully
**MapredLocal task succeeded**
OK
266559841
Time taken: 270.176 seconds, Fetched: 1 row(s)
What are we missing?

Set hive.exec.mode.local.auto as false which will disable the local mode execution in Hive

For each query the compiler generates DAG of map-reduce jobs. If the job runs in local mode, check below properties:
mapreduce.framework.name=local;
hive.exec.mode.local.auto=false;
If auto option is enabled then hive run the job in local mode if
Total input size < hive.exec.mode.local.auto.inputbytes.max
Total number of map tasks < hive.exec.mode.local.auto.tasks.max
Total number of reduce tasks =< 1 or 0
These options are available from 0.7

Related

Why flink container vcore size is always 1

I am running flink on yarn(more precisely in AWS EMR yarn cluster).
I read flink document and source code that by default for each task manager container, flink will request the number of slot per task manager as the number of vcores when request resource from yarn.
And I also confirmed from the source code:
// Resource requirements for worker containers
int taskManagerSlots = taskManagerParameters.numSlots();
int vcores = config.getInteger(ConfigConstants.YARN_VCORES,
Math.max(taskManagerSlots, 1));
Resource capability = Resource.newInstance(containerMemorySizeMB,
vcores);
resourceManagerClient.addContainerRequest(
new AMRMClient.ContainerRequest(capability, null, null,
priority));
When I use -yn 1 -ys 3 to start flink, I assume yarn will allocate 3 vcores for the only task manager container, but when I checked the number of vcores for each container from yarn resource manager web ui, I always see the number of vcores is 1. I also see vcore to be 1 from yarn resource manager logs.
I debugged the flink source code to the line I pasted below, and I saw value of vcores is 3.
This is really confuse me, can anyone help to clarify for me, thanks.
An answer from Kien Truong
Hi,
You have to enable CPU scheduling in YARN, otherwise, it always shows that only 1 CPU is allocated for each container,
regardless of how many Flink try to allocate. So you should add (edit) the following property in capacity-scheduler.xml:
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<!-- <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> -->
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
ALso, taskManager memory is, for example, 1400MB, but Flink reserves some amount for off-heap memory, so the actual heap size is smaller.
This is controlled by 2 settings:
containerized.heap-cutoff-min: default 600MB
containerized.heap-cutoff-ratio: default 15% of TM's memory
That's why your TM's heap size is limitted to ~800MB (1400 - 600)
#yinhua.
Use the command to start a session:./bin/yarn-session.sh, you need add -s arg.
-s,--slots Number of slots per TaskManager
details:
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/yarn_setup.html
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/cli.html#usage
I get the answer finally.
It's because yarn is use "DefaultResourceCalculator" allocation strategy, so only memory is counted for yarn RM, even if flink requested 3 vcores, but yarn simply ignore the cpu core number.

Cloudera ToolRunner

I am using Hue for accessing Hive Service. I Created a Hive table using
create table tablename(colname type,.....)
row format delimited fields terminated by ',';
I Uploaded the data with 300 000 record perfectly. But while executing a query like:
select count(*) from tablename;
it is creating MapReduce job and at this time I get the following warning, How to resolve this warning.
WARN : Hadoop command-line option parsing not performed. Implement
the Tool interface and execute your application with ToolRunner to
remedy this.
Complete Log:
INFO : Number of reduce tasks determined at compile time: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1442315442114_0017
INFO : The url to track the job: http://dwiclmaster:8088/proxy/application_1442315442114_0017/
INFO : Starting Job = job_1442315442114_0017, Tracking URL = http://dwiclmaster:8088/proxy/application_1442315442114_0017/
INFO : Kill Command = /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/bin/hadoop job -kill job_1442315442114_0017
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2015-09-15 18:29:06,910 Stage-1 map = 0%, reduce = 0%
INFO : 2015-09-15 18:29:15,257 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.65 sec
INFO : 2015-09-15 18:29:21,513 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.19 sec
INFO : MapReduce Total cumulative CPU time: 3 seconds 190 msec
INFO : Ended Job = job_1442315442114_0017
This is just a warning coming up from MapReduce as jobs submitted by Hive do not implement the interface. This can be safely ignored.
More about Tool Runner.

Spark execution occasionally gets stuck at mapPartitions at Exchange.scala:44

I am running a Spark job on a two node standalone cluster (v 1.0.1).
Spark execution often gets stuck at the task mapPartitions at Exchange.scala:44.
This happens at the final stage of my job in a call to saveAsTextFile (as I expect from Spark's lazy execution).
It is hard to diagnose the problem because I never experience it in local mode with local IO paths, and occasionally the job on the cluster does complete as expected with the correct output (same output as with local mode).
This seems possibly related to reading from s3 (of a ~170MB file) immediately prior, as I see the following logging in the console:
DEBUG NativeS3FileSystem - getFileStatus returning 'file' for key '[PATH_REMOVED].avro'
INFO FileInputFormat - Total input paths to process : 1
DEBUG FileInputFormat - Total # of splits: 3
...
INFO DAGScheduler - Submitting 3 missing tasks from Stage 32 (MapPartitionsRDD[96] at mapPartitions at Exchange.scala:44)
DEBUG DAGScheduler - New pending tasks: Set(ShuffleMapTask(32, 0), ShuffleMapTask(32, 1), ShuffleMapTask(32, 2))
The last logging I see before the task apparently hangs/gets stuck is:
INFO NativeS3FileSystem: INFO NativeS3FileSystem: Opening key '[PATH_REMOVED].avro' for reading at position '67108864'
Has anyone else experience non-deterministic problems related to reading from s3 in Spark?

Exception while extracting substring from hive column

I have one column "category" which contain data like this
"Failed extract of third-party root list from auto update cab at: <http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: The data is invalid."
I need to select url part in between " < > " sign of category column.
I have written a hive query -
select level,category,regexp_extract(category,'http://[^\>]*') AS url from event where level='Error';
I got an exception :
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201406122248_0014, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201406122248_0014
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201406122248_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-06-13 02:13:35,696 Stage-1 map = 0%, reduce = 0%
2014-06-13 02:14:13,895 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201406122248_0014 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201406122248_0014
Examining task ID: task_201406122248_0014_m_000002 (and more) from job job_201406122248_0014
Task with the most failures(4):
-----
Task ID:
task_201406122248_0014_m_000000
URL:
http://localhost.localdomain:50030/taskdetails.jsp?jobid=job_201406122248_0014&tipid=task_201406122248_0014_m_000000
-----
Diagnostic Messages for this Task:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"level":"Error","datetimes":"6/13/2014 9:24:05 AM","source":"Microsoft-Windows-CAPI2","eventid":4107,"task":"None","category":"\"Failed extract of third-party root list from auto update cab at: <http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: The data is invalid."}
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:159)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
how to fix this?
Please help.

errors while running hive queries

I am trying to run hive queries but I am getting errors as:
hive> FROM (
> FROM t1
> MAP t1.patient_mrn, t1.encounter_date
> USING 'retrieve'
> AS mp1, mp2
> CLUSTER BY mp1) map_output
> INSERT OVERWRITE TABLE t3
> REDUCE map_output.mp1, map_output.mp2
> USING 'q1.txt'
> AS reducef1, reducef2;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201112281627_0097, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201112281627_0097
Kill Command = /home/hadoop/hadoop-0.20.2-cdh3u2//bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201112281627_0097
2011-12-31 03:10:46,391 Stage-1 map = 0%, reduce = 0%
2011-12-31 03:11:29,794 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201112281627_0097 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
hive>
Best advice without knowing a lot more is where to find the error logs. So go to your JobTracker's web page, find the page for that job, and drill down to find the error logs.
Look for any "failed" tasks, click there to get to the page for that specific task.
You'll eventually get to the page containing the task-specific log, and that should help you diagnose the problem.
This could happen in n number of scenarios. Rerun the query once more and check the jobtracker for the failed/killed attempts and go through the logs for exact reason.