FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask - apache

While I am trying to overwrite Hive managed table to external table I am getting below error.
Query ID = hdfs_20150701054444_576849f9-6b25-4627-b79d-f5defc13c519
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1435327542589_0049, Tracking URL = http://ip-XXX.XX.XX.XX ec2.internal:8088/proxy/application_1435327542589_0049/
Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435327542589_0049
Hadoop job information for Stage-0: number of mappers: 0; number of reducers: 0
2015-07-01 05:44:11,676 Stage-0 map = 0%, reduce = 0%
Ended Job = job_1435327542589_0049 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-0: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

Related

TEZ mapper resource request

We recently migrated from MapReduce to TEZ for executing Hive queries on EMR. We are seeing cases where for the exact hive query launches very different number of mappers. See Map 3 phase below. On the first run it requested for 305 resources and on another run it requested for 4534 mappers. ( Please ignore the KILLED status because I manually killed the query.) Why does this happen ? How can we change it to be based on underlying data size instead ?
Run 1
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 container KILLED 5 0 0 5 0 0
Map 3 container KILLED 305 0 0 305 0 0
Map 5 container KILLED 16 0 0 16 0 0
Map 6 container KILLED 1 0 0 1 0 0
Reducer 2 container KILLED 333 0 0 333 0 0
Reducer 4 container KILLED 796 0 0 796 0 0
----------------------------------------------------------------------------------------------
VERTICES: 00/06 [>>--------------------------] 0% ELAPSED TIME: 14.16 s
----------------------------------------------------------------------------------------------
Run 2
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 5 5 0 0 0 0
Map 3 container KILLED 4534 0 0 4534 0 0
Map 5 .......... container SUCCEEDED 325 325 0 0 0 0
Map 6 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 container KILLED 333 0 0 333 0 0
Reducer 4 container KILLED 796 0 0 796 0 0
----------------------------------------------------------------------------------------------
VERTICES: 03/06 [=>>-------------------------] 5% ELAPSED TIME: 527.16 s
----------------------------------------------------------------------------------------------
This article explains the process in which Tez allocates resources. https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
If Tez grouping is enabled for the splits, then a generic grouping
logic is run on these splits to group them into larger splits. The
idea is to strike a balance between how parallel the processing is and
how much work is being done in each parallel process.
First, Tez tries to find out the resource availability in the cluster for these tasks. For that, YARN provides a headroom value (and
in future other attributes may be used). Lets say this value is T.
Next, Tez divides T with the resource per task (say M) to find out how many tasks can run in parallel at one (ie in a single wave). W =
T/M.
Next W is multiplied by a wave factor (from configuration - tez.grouping.split-waves) to determine the number of tasks to be used.
Lets say this value is N.
If there are a total of X splits (input shards) and N tasks then this would group X/N splits per task. Tez then estimates the size of
data per task based on the number of splits per task.
If this value is between tez.grouping.max-size & tez.grouping.min-size then N is accepted as the number of tasks. If
not, then N is adjusted to bring the data per task in line with the
max/min depending on which threshold was crossed.
For experimental purposes tez.grouping.split-count can be set in configuration to specify the desired number of groups. If this config
is specified then the above logic is ignored and Tez tries to group
splits into the specified number of groups. This is best effort.
After this the grouping algorithm is executed. It groups splits by node locality, then rack locality, while respecting the group size
limits.

Make hive returns the value only

I want to make hive only returns the value only! not other words like information about the processing!
hive> select max(temp) from temp where dtime like '2014-07%' ;
Query ID = hduser_20170608003255_d35b8a43-8cc5-4662-89ce-9ee5f87d3ba0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1496864651740_0008, Tracking URL = http://localhost:8088/proxy/application_1496864651740_0008/
Kill Command = /home/hduser/hadoop/bin/hadoop job -kill job_1496864651740_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-06-08 00:33:01,955 Stage-1 map = 0%, reduce = 0%
2017-06-08 00:33:08,187 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.13 sec
2017-06-08 00:33:14,414 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.91 sec
MapReduce Total cumulative CPU time: 5 seconds 910 msec
Ended Job = job_1496864651740_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.91 sec HDFS Read: 853158 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 910 msec
OK
44.4
Time taken: 20.01 seconds, Fetched: 1 row(s)
I want it to return the value only which is 44.4
Thanks in advance...
You can put result into variable in a shell script. max_temp variable will contain the result only:
max_temp=$(hive -e " set hive.cli.print.header=false; select max(temp) from temp where dtime like '2014-07%';")
echo "$max_temp"
You can also use -S
hive -S -e "select max(temp) from temp where dtime like '2014-07%';"

Is there a way to get database name in a hive UDF

I am writing a Hive UDF.
I have to get the name of the database (the function is deployed in). Then, I need to access a few files from hdfs depending on the database environment. Can you please help me which function can help with running a HQL query from Hive UDF.
write UDF class and prepare jar file
public class MyHiveUdf extends UDF {
public Text evaluate(String text,String dbName) {
if(text == null) {
return null;
} else {
return new Text(dbName+"."+text);
}
}
}
Use this UDF inside hive query like mentioned below
hive> use mydb;
OK
Time taken: 0.454 seconds
hive> ADD jar /root/MyUdf.jar;
Added [/root/MyUdf.jar] to class path
Added resources: [/root/MyUdf.jar]
hive> create temporary function myUdfFunction as 'com.hiveudf.strmnp.MyHiveUdf';
OK
Time taken: 0.018 seconds
hive> select myUdfFunction(username,current_database()) from users;
Query ID = root_20170407151010_2ae29523-cd9f-4585-b334-e0b61db2c57b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1491484583384_0004, Tracking URL = http://mac127:8088/proxy/application_1491484583384_0004/
Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1491484583384_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-07 15:11:11,376 Stage-1 map = 0%, reduce = 0%
2017-04-07 15:11:19,766 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.12 sec
MapReduce Total cumulative CPU time: 3 seconds 120 msec
Ended Job = job_1491484583384_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.12 sec HDFS Read: 21659 HDFS Write: 381120 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 120 msec
OK
mydb.user1
mydb.user2
mydb.user3
Time taken: 2.137 seconds, Fetched: 3 row(s)
hive>

Apache Pig: OutOfMemoryError with FOREACH aggregation

I'm getting an OutOfMemoryError with running a FOREACH operation in Apache Pig.
16/06/24 15:14:17 INFO util.SpillableMemoryManager: first memory
handler call- Usage threshold init = 164102144(160256K) used =
556137816(543103K) committed = 698875904(682496K) max =
698875904(682496K)
java.lang.OutOfMemoryError: Java heap space
-XX:OnOutOfMemoryError="kill -9 %p"
Executing /bin/sh -c "kill -9 4095"... Killed
My Pig script:
A = LOAD 'PageCountTest/' USING PigStorage(' ') AS (Project:chararray,
Title:chararray, count:int , size:int);
B = GROUP A BY (Project,Title);
C = FOREACH B
generate group, SUM(A.count) AS COUNT; D = ORDER C BY COUNT DESC;
STORE C INTO '/user/hadoop/wikistats';
Sample Data:
aa.b Main_Page 1 14335
aa.d India 1 4075
aa.d Main_Page 1 13190
aa.d Special:RecentChanges 1 200
aa.d Talk:Main_Page/ 1 14147
aa.d w/w/index.php 9 137502
aa Main_Page 6 9872
aa Special:Statistics 1 324
Can anyone please help?
I do suspect that a memory issue arise when you order by as it is the most heavy weighted. The easiest way is to play with the heap size parameters while launching your pig job.
pig -Dmapred.child.java.opts=-Xms2096M yourjob.pig
Also you can declare heap size directly inside the script.
export PIG_HEAPSIZE=2096

Write the output of a script to a file in hive

I have a script with a set of 5 queries.I would like to execute the script and write the output to a file.What command should I give from the hive cli.
Thanks
sample Queries file (3 queries) :
ramisetty#aspire:~/my_tmp$ cat queries.q
show databases; --query1
use my_db; --query2
INSERT OVERWRITE LOCAL DIRECTORY './outputLocalDir' --query3
select * from students where branch = "ECE"; --query3
Run HIVE:
ramisetty#aspire:~/my_tmp$ hive
hive (default)> source ./queries.q;
--output of Q1 on console-----
Time taken: 7.689 seconds
--output of Q2 on console -----
Time taken: 1.689 seconds
____________________________________________________________
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201401251835_0004, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201401251835_0004
Kill Command = /home/ramisetty/VJDATA/hadoop-1.0.4/libexec/../bin/hadoop job -kill job_201401251835_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-01-25 19:06:56,689 Stage-1 map = 0%, reduce = 0%
2014-01-25 19:07:05,868 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.07 sec
2014-01-25 19:07:14,047 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.07 sec
2014-01-25 19:07:15,059 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.07 sec
MapReduce Total cumulative CPU time: 2 seconds 70 msec
Ended Job = job_201401251835_0004
**Copying data to local directory outputLocalDir
Copying data to local directory outputLocalDir**
2 Rows loaded to outputLocalDir
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.07 sec HDFS Read: 525 HDFS Write: 66 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 70 msec
OK
firstname secondname dob score branch
Time taken: 32.44 seconds
output file :
cat ./outputLocalDir/000000_0