Cloudera ToolRunner - hive

I am using Hue for accessing Hive Service. I Created a Hive table using
create table tablename(colname type,.....)
row format delimited fields terminated by ',';
I Uploaded the data with 300 000 record perfectly. But while executing a query like:
select count(*) from tablename;
it is creating MapReduce job and at this time I get the following warning, How to resolve this warning.
WARN : Hadoop command-line option parsing not performed. Implement
the Tool interface and execute your application with ToolRunner to
remedy this.
Complete Log:
INFO : Number of reduce tasks determined at compile time: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1442315442114_0017
INFO : The url to track the job: http://dwiclmaster:8088/proxy/application_1442315442114_0017/
INFO : Starting Job = job_1442315442114_0017, Tracking URL = http://dwiclmaster:8088/proxy/application_1442315442114_0017/
INFO : Kill Command = /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/bin/hadoop job -kill job_1442315442114_0017
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2015-09-15 18:29:06,910 Stage-1 map = 0%, reduce = 0%
INFO : 2015-09-15 18:29:15,257 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.65 sec
INFO : 2015-09-15 18:29:21,513 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.19 sec
INFO : MapReduce Total cumulative CPU time: 3 seconds 190 msec
INFO : Ended Job = job_1442315442114_0017

This is just a warning coming up from MapReduce as jobs submitted by Hive do not implement the interface. This can be safely ignored.
More about Tool Runner.

Related

Spark cannot query Hive tables it can see?

I'm running the prebuilt version of Spark 1.2 for CDH 4 on CentOS. I have copied the hive-site.xml file into the conf directory in Spark so it should see the Hive metastore.
I have three tables in Hive (facility, newpercentile, percentile), all of which I can query from the Hive CLI. After I log into Spark and create the Hive Context like so: val hiveC = new org.apache.spark.sql.hive.HiveContext(sc) I am running into an issue querying these tables.
If I run the following command: val tableList = hiveC.hql("show tables") and do a collect() on tableList, I get this result: res0: Array[org.apache.spark.sql.Row] = Array([facility], [newpercentile], [percentile])
If I then run this command to get the count of the facility table: val facTable = hiveC.hql("select count(*) from facility"), I get the following output, which I take to mean that it cannot find the facility table to query it:
scala> val facTable = hiveC.hql("select count(*) from facility")
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
14/12/26 10:27:26 WARN HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
14/12/26 10:27:26 INFO ParseDriver: Parsing command: select count(*) from facility
14/12/26 10:27:26 INFO ParseDriver: Parse Completed
14/12/26 10:27:26 INFO MemoryStore: ensureFreeSpace(355177) called with curMem=0, maxMem=277842493
14/12/26 10:27:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 346.9 KB, free 264.6 MB)
14/12/26 10:27:26 INFO MemoryStore: ensureFreeSpace(50689) called with curMem=355177, maxMem=277842493
14/12/26 10:27:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 49.5 KB, free 264.6 MB)
14/12/26 10:27:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.2.15:45305 (size: 49.5 KB, free: 264.9 MB)
14/12/26 10:27:26 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
14/12/26 10:27:26 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:68
facTable: org.apache.spark.sql.SchemaRDD =
SchemaRDD[2] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
Aggregate false, [], [Coalesce(SUM(PartialCount#38L),0) AS _c0#5L]
Exchange SinglePartition
Aggregate true, [], [COUNT(1) AS PartialCount#38L]
HiveTableScan [], (MetastoreRelation default, facility, None), None
Any assistance would be appreciated. Thanks.
scala> val facTable = hiveC.hql("select count(*) from facility")
Great! You have an RDD, now what do you want to do with it?
scala> facTable.collect()
Remember that an RDD is an abstraction on top of your data and is not materialized until you invoke an action on it such as collect() or count().
You would get a very obvious error if you tried to use a non-existent table name.

Spark execution occasionally gets stuck at mapPartitions at Exchange.scala:44

I am running a Spark job on a two node standalone cluster (v 1.0.1).
Spark execution often gets stuck at the task mapPartitions at Exchange.scala:44.
This happens at the final stage of my job in a call to saveAsTextFile (as I expect from Spark's lazy execution).
It is hard to diagnose the problem because I never experience it in local mode with local IO paths, and occasionally the job on the cluster does complete as expected with the correct output (same output as with local mode).
This seems possibly related to reading from s3 (of a ~170MB file) immediately prior, as I see the following logging in the console:
DEBUG NativeS3FileSystem - getFileStatus returning 'file' for key '[PATH_REMOVED].avro'
INFO FileInputFormat - Total input paths to process : 1
DEBUG FileInputFormat - Total # of splits: 3
...
INFO DAGScheduler - Submitting 3 missing tasks from Stage 32 (MapPartitionsRDD[96] at mapPartitions at Exchange.scala:44)
DEBUG DAGScheduler - New pending tasks: Set(ShuffleMapTask(32, 0), ShuffleMapTask(32, 1), ShuffleMapTask(32, 2))
The last logging I see before the task apparently hangs/gets stuck is:
INFO NativeS3FileSystem: INFO NativeS3FileSystem: Opening key '[PATH_REMOVED].avro' for reading at position '67108864'
Has anyone else experience non-deterministic problems related to reading from s3 in Spark?

Exception while extracting substring from hive column

I have one column "category" which contain data like this
"Failed extract of third-party root list from auto update cab at: <http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: The data is invalid."
I need to select url part in between " < > " sign of category column.
I have written a hive query -
select level,category,regexp_extract(category,'http://[^\>]*') AS url from event where level='Error';
I got an exception :
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201406122248_0014, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201406122248_0014
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201406122248_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-06-13 02:13:35,696 Stage-1 map = 0%, reduce = 0%
2014-06-13 02:14:13,895 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201406122248_0014 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201406122248_0014
Examining task ID: task_201406122248_0014_m_000002 (and more) from job job_201406122248_0014
Task with the most failures(4):
-----
Task ID:
task_201406122248_0014_m_000000
URL:
http://localhost.localdomain:50030/taskdetails.jsp?jobid=job_201406122248_0014&tipid=task_201406122248_0014_m_000000
-----
Diagnostic Messages for this Task:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"level":"Error","datetimes":"6/13/2014 9:24:05 AM","source":"Microsoft-Windows-CAPI2","eventid":4107,"task":"None","category":"\"Failed extract of third-party root list from auto update cab at: <http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: The data is invalid."}
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:159)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
how to fix this?
Please help.

Hive always run mapred jobs in local mode

We are testing a multi node hadoop cluster (2.4.0) with Hive (0.13.0). The cluster works fine, but when we runa a query in hive, the mapred job are always executed locally.
For example:
Without hive-site.xml (in fact, without any configuration file other than defaults) we set mapred.job.tracker:
hive> SET mapred.job.tracker=192.168.7.183:8032;
And run a query:
hive> select count(1) from suricata;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
OpenJDK 64-Bit Server VM warning: You have loaded library /hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/04/29 12:48:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/29 12:48:02 WARN conf.Configuration: file:/tmp/hadoopuser/hive_2014-04-29_12-47-57_290_2455239450939088471-1/-local-10003/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/29 12:48:02 WARN conf.Configuration: file:/tmp/hadoopuser/hive_2014-04-29_12-47-57_290_2455239450939088471-1/-local-10003/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
Execution log at: /tmp/hadoopuser/hadoopuser_20140429124747_badfcce6-620e-4718-8c3b-e4ef76bdba7e.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2014-04-29 12:48:05,450 null map = 0%, reduce = 0%
.......
.......
2014-04-29 12:52:26,982 null map = 100%, reduce = 100%
Ended Job = job_local1983771849_0001
Execution completed successfully
**MapredLocal task succeeded**
OK
266559841
Time taken: 270.176 seconds, Fetched: 1 row(s)
What are we missing?
Set hive.exec.mode.local.auto as false which will disable the local mode execution in Hive
For each query the compiler generates DAG of map-reduce jobs. If the job runs in local mode, check below properties:
mapreduce.framework.name=local;
hive.exec.mode.local.auto=false;
If auto option is enabled then hive run the job in local mode if
Total input size < hive.exec.mode.local.auto.inputbytes.max
Total number of map tasks < hive.exec.mode.local.auto.tasks.max
Total number of reduce tasks =< 1 or 0
These options are available from 0.7

errors while running hive queries

I am trying to run hive queries but I am getting errors as:
hive> FROM (
> FROM t1
> MAP t1.patient_mrn, t1.encounter_date
> USING 'retrieve'
> AS mp1, mp2
> CLUSTER BY mp1) map_output
> INSERT OVERWRITE TABLE t3
> REDUCE map_output.mp1, map_output.mp2
> USING 'q1.txt'
> AS reducef1, reducef2;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201112281627_0097, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201112281627_0097
Kill Command = /home/hadoop/hadoop-0.20.2-cdh3u2//bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201112281627_0097
2011-12-31 03:10:46,391 Stage-1 map = 0%, reduce = 0%
2011-12-31 03:11:29,794 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201112281627_0097 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
hive>
Best advice without knowing a lot more is where to find the error logs. So go to your JobTracker's web page, find the page for that job, and drill down to find the error logs.
Look for any "failed" tasks, click there to get to the page for that specific task.
You'll eventually get to the page containing the task-specific log, and that should help you diagnose the problem.
This could happen in n number of scenarios. Rerun the query once more and check the jobtracker for the failed/killed attempts and go through the logs for exact reason.