I am using HDP 2.6 Sandbox. I have created a user space with user root under hdfs group and executing following sqoop hive import and encountering following 2 errors:
Failed with exception org.apache.hadoop.security.AccessControlException: User null does not belong to Hadoop at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setOwner(FSDirAttrOp.java:89)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
However, data got imported correctly into hive table.
Please help me to understand the significant of this error and how can I overcome this error.
[root#sandbox-hdp ~]# sqoop import \
> --connect jdbc:mysql://sandbox.hortonworks.com:3306/retail_db \
> --username retail_dba \
> --password hadoop \
> --table departments \
> --hive-home /apps/hive/warehouse \
> --hive-import \
> --create-hive-table \
> --hive-table retail_db.departments \
> --target-dir /user/root/hive_import \
> --outdir java_files
Warning: /usr/hdp/2.6.3.0-235/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/01/14 09:42:38 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.6.3.0-235
18/01/14 09:42:38 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/01/14 09:42:38 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
18/01/14 09:42:38 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
18/01/14 09:42:38 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/01/14 09:42:38 INFO tool.CodeGenTool: Beginning code generation
18/01/14 09:42:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `departments` AS t LIMIT 1
18/01/14 09:42:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `departments` AS t LIMIT 1
18/01/14 09:42:39 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.6.3.0-235/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/e1ec5b443f92219f1f061ad4b64cc824/departments.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/01/14 09:42:40 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/e1ec5b443f92219f1f061ad4b64cc824/departments.jar
18/01/14 09:42:40 WARN manager.MySQLManager: It looks like you are importing from mysql.
18/01/14 09:42:40 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
18/01/14 09:42:40 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
18/01/14 09:42:40 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
18/01/14 09:42:40 INFO mapreduce.ImportJobBase: Beginning import of departments
18/01/14 09:42:41 INFO client.RMProxy: Connecting to ResourceManager at sandbox-hdp.hortonworks.com/172.17.0.2:8032
18/01/14 09:42:42 INFO client.AHSProxy: Connecting to Application History server at sandbox-hdp.hortonworks.com/172.17.0.2:10200
18/01/14 09:42:46 INFO db.DBInputFormat: Using read commited transaction isolation
18/01/14 09:42:46 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`department_id`), MAX(`department_id`) FROM `departments`
18/01/14 09:42:46 INFO db.IntegerSplitter: Split size: 1; Num splits: 4 from: 2 to: 7
18/01/14 09:42:46 INFO mapreduce.JobSubmitter: number of splits:4
18/01/14 09:42:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515818851132_0050
18/01/14 09:42:47 INFO impl.YarnClientImpl: Submitted application application_1515818851132_0050
18/01/14 09:42:47 INFO mapreduce.Job: The url to track the job: http://sandbox-hdp.hortonworks.com:8088/proxy/application_1515818851132_0050/
18/01/14 09:42:47 INFO mapreduce.Job: Running job: job_1515818851132_0050
18/01/14 09:42:55 INFO mapreduce.Job: Job job_1515818851132_0050 running in uber mode : false
18/01/14 09:42:55 INFO mapreduce.Job: map 0% reduce 0%
18/01/14 09:43:05 INFO mapreduce.Job: map 25% reduce 0%
18/01/14 09:43:09 INFO mapreduce.Job: map 50% reduce 0%
18/01/14 09:43:12 INFO mapreduce.Job: map 75% reduce 0%
18/01/14 09:43:14 INFO mapreduce.Job: map 100% reduce 0%
18/01/14 09:43:14 INFO mapreduce.Job: Job job_1515818851132_0050 completed successfully
18/01/14 09:43:16 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=682132
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=481
HDFS: Number of bytes written=60
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=44760
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=44760
Total vcore-milliseconds taken by all map tasks=44760
Total megabyte-milliseconds taken by all map tasks=11190000
Map-Reduce Framework
Map input records=6
Map output records=6
Input split bytes=481
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=1284
CPU time spent (ms)=5360
Physical memory (bytes) snapshot=561950720
Virtual memory (bytes) snapshot=8531210240
Total committed heap usage (bytes)=176685056
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=60
18/01/14 09:43:16 INFO mapreduce.ImportJobBase: Transferred 60 bytes in 34.7351 seconds (1.7274 bytes/sec)
18/01/14 09:43:16 INFO mapreduce.ImportJobBase: Retrieved 6 records.
18/01/14 09:43:16 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners
18/01/14 09:43:16 WARN mapreduce.PublishJobData: Unable to publish import data to publisher org.apache.atlas.sqoop.hook.SqoopHook
java.lang.ClassNotFoundException: org.apache.atlas.sqoop.hook.SqoopHook
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.sqoop.mapreduce.PublishJobData.publishJobData(PublishJobData.java:46)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:284)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:127)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:507)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:615)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:225)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.main(Sqoop.java:243)
18/01/14 09:43:16 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `departments` AS t LIMIT 1
18/01/14 09:43:16 INFO hive.HiveImport: Loading uploaded data into Hive
Logging initialized using configuration in jar:file:/usr/hdp/2.6.3.0-235/hive/lib/hive-common-1.2.1000.2.6.3.0-235.jar!/hive-log4j.properties
OK
Time taken: 10.427 seconds
Loading data to table retail_db.departments
Failed with exception org.apache.hadoop.security.AccessControlException: User null does not belong to Hadoop at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setOwner(FSDirAttrOp.java:89) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setOwner(FSNamesystem.java:1873) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setOwner(NameNodeRpcServer.java:828)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setOwner(ClientNamenodeProtocolServerSideTranslatorPB.java:476)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
The first error
WARN mapreduce.PublishJobData: Unable to publish import data to publisher org.apache.atlas.sqoop.hook.SqoopHook java.lang.ClassNotFoundException: org.apache.atlas.sqoop.hook.SqoopHook
You need to check if Sqoop binaries are ok. Better copy them again so you don't need to ckeck file by file.
The second error
Failed with exception org.apache.hadoop.security.AccessControlException: User null does not belong to Hadoop
Is because you are executing sqoop with "root" user. Change it to user that exists in the hadoop cluster.
Two ideas
ClassNotFoundException: org.apache.atlas.sqoop.hook.SqoopHook
There is a class missing somewhere.
And I see you're trying to run you sqoop command using your root account under LINUX. Make sure root belong to hdfs group. I'm not sure root is included by default.
Sometimes null values will not handle by Sqoop while importing data into Hive from RDBMS so you should handle them explicitly by using the following keys:
--null-string and --null-non-string
Complete command is
sqoop import --connect jdbc:mysql://sandbox.hortonworks.com:3306/retail_db --username retail_dba --password hadoop --table departments --hive-home /apps/hive/warehouse --null-string 'na' --null-non-string 'na' --hive-import --create-hive-table --hive-table retail_db.departments --target-dir /user/root/hive_import
It's occurrence is due to the field in in /etc/hive/conf/hive-site.xml:
<name>hive.warehouse.subdir.inherit.perms</name>
<value>true</value>
Set the value to false and try to run the same query,
Or else make the --target-dir /user/root/hive_import to read/write access directory or remove it, it will take the hive home directory
I know this is one of the most repeated question. I have looked almost everywhere and none of the resources could resolve the issue I am facing.
Below is the simplified version of my problem statement. But in actual data is little complex so I have to use UDF
My input File: (input.txt)
NotNeeded1,NotNeeded11;Needed1
NotNeeded2,NotNeeded22;Needed2
I want the output to be
Needed1
Needed2
So, I am writing the below UDF
(Java code):
package com.company.pig;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class myudf extends EvalFunc<String>{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
String s = (String)input.get(0);
String str = s.split("\\,")[1];
String str1 = str.split("\\;")[1];
return str1;
}
}
And packaging it into
rollupreg_extract-jar-with-dependencies.jar
Below is my pig shell code
grunt> REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
grunt> DEFINE myudf com.company.pig.myudf;
grunt> data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(',');
grunt> extract = FOREACH data GENERATE myudf($1);
grunt> DUMP extract;
And I get the below error:
2017-05-15 15:58:15,493 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2017-05-15 15:58:15,577 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2017-05-15 15:58:15,659 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2017-05-15 15:58:15,774 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2017-05-15 15:58:15,865 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2017-05-15 15:58:15,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2017-05-15 15:58:15,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2017-05-15 15:58:16,184 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:16,196 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:16,396 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:16,576 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2017-05-15 15:58:16,580 [main] WARN org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
2017-05-15 15:58:16,584 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2017-05-15 15:58:16,588 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2017-05-15 15:58:17,258 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/pig/rollupreg_extract-jar-with-dependencies.jar to DistributedCache through /tmp/temp-1119775568/tmp-858482998/rollupreg_extract-jar-with-dependencies.jar
2017-05-15 15:58:17,276 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2017-05-15 15:58:17,294 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2017-05-15 15:58:17,295 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2017-05-15 15:58:17,295 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2017-05-15 15:58:17,354 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2017-05-15 15:58:17,510 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:17,511 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:17,511 [JobControl] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:17,753 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2017-05-15 15:58:17,820 [JobControl] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
2017-05-15 15:58:17,830 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-05-15 15:58:17,830 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2017-05-15 15:58:17,884 [JobControl] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2017-05-15 15:58:17,889 [JobControl] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev 7a4b57bedce694048432dd5bf5b90a6c8ccdba80]
2017-05-15 15:58:17,922 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2017-05-15 15:58:18,525 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2017-05-15 15:58:18,692 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1494853652295_0023
2017-05-15 15:58:18,879 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2017-05-15 15:58:18,973 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1494853652295_0023
2017-05-15 15:58:19,029 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1494853652295_0023/
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1494853652295_0023
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases data,extract
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[2,7],extract[3,10] C: R:
2017-05-15 15:58:19,044 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2017-05-15 15:58:19,044 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1494853652295_0023]
2017-05-15 15:58:29,156 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2017-05-15 15:58:29,156 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1494853652295_0023 has failed! Stop running all dependent jobs
2017-05-15 15:58:29,157 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2017-05-15 15:58:29,790 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:29,791 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:29,793 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:30,311 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:30,312 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:30,313 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:30,465 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2017-05-15 15:58:30,467 [main] WARN org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
2017-05-15 15:58:30,472 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.7.3.2.5.0.0-1245 root 2017-05-15 15:58:16 2017-05-15 15:58:30 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_1494853652295_0023 data,extract MAP_ONLY Message: Job failed! hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225,
Input(s):
Failed to read data from "/pig_hdfs/input.txt"
Output(s):
Failed to produce result in "hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1494853652295_0023
2017-05-15 15:58:30,472 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2017-05-15 15:58:30,499 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias extract
Details at logfile: /pig/pig_1494863836458.log
I know it complaints that
Failed to read data from "/pig_hdfs/input.txt"
But I am sure this is not the actual issue. If I don't use the udf and directly dump the data, I get the output. So, this is not the issue.
First, you do not need an udf to get the desired output.You can use semi colon as the delimiter in load statement and get the needed column.
data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(';');
extract = FOREACH data GENERATE $1;
DUMP extract;
If you insist on using udf then you will have to load the record into a single field and then use the udf.Also,your udf is incorrect.You should split the string s with ';' as the delimiter, which is passed from the pig script.
String s = (String)input.get(0);
String str1 = s.split("\\;")[1];
And in your pig script,you need to load the entire record into 1 field and use the udf on field $0.
REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
DEFINE myudf com.company.pig.myudf;
data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' AS (f1:chararray);
extract = FOREACH data GENERATE myudf($0);
DUMP extract;
I am using Cloudera Quick Start Docker image
The quickstart image has mysql installed in it. When i use following sqoop command from command line to import categories table it works and i can see that categories table is created
sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera -m 1 --table categories --hive-import --hive-overwrite
Then i logged into Hue as cloudera user and i did create a new oozie workflow with single sqoop task, but when i try to execute that sqoop is able to download the data into HDFS, but when it tries to create hive table on top of that it fails
This is how my workflow.xml looks like
<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5" xmlns:sla="uri:oozie:sla:0.2">
<start to="sqoop-4467"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="sqoop-4467">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera -m 1 --table categories --hive-import --hive-overwrite
</command>
</sqoop>
<ok to="End"/>
<error to="Kill"/>
<sla:info>
<sla:nominal-time>${nominal_time}</sla:nominal-time>
<sla:should-end>${30 * MINUTES}</sla:should-end>
</sla:info>
</action>
<end name="End"/>
</workflow-app>
This is how my job.properties file looks like
oozie.use.system.libpath=True
security_enabled=False
dryrun=False
nameNode=hdfs://quickstart.cloudera:8020
nominal_time=2016-12-20T20:53Z
jobTracker=quickstart.cloudera:8032
After the job failed, when i checked the /user/home/cloudera folder i can see the categories folder with data but i dont see the hive table being created. This is the error that i see in the jobhistory server for the failed job
Sqoop command arguments :
import
--connect
jdbc:mysql://localhost/retail_db
--username
root
--password
cloudera
-m
1
--table
categories
--hive-import
--hive-overwrite
Fetching child yarn jobs
tag id : oozie-3ff81b7743470e73dcb44de6729a66d9
Child yarn jobs are found -
=================================================================
>>> Invoking Sqoop command line now >>>
6223 [uber-SubtaskRunner] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
6302 [uber-SubtaskRunner] INFO org.apache.sqoop.Sqoop - Running Sqoop version: 1.4.6-cdh5.7.0
6336 [uber-SubtaskRunner] WARN org.apache.sqoop.tool.BaseSqoopTool - Setting your password on the command-line is insecure. Consider using -P instead.
6336 [uber-SubtaskRunner] INFO org.apache.sqoop.tool.BaseSqoopTool - Using Hive-specific delimiters for output. You can override
6336 [uber-SubtaskRunner] INFO org.apache.sqoop.tool.BaseSqoopTool - delimiters with --fields-terminated-by, etc.
6367 [uber-SubtaskRunner] WARN org.apache.sqoop.ConnFactory - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
6654 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.MySQLManager - Preparing to use a MySQL streaming resultset.
6666 [uber-SubtaskRunner] INFO org.apache.sqoop.tool.CodeGenTool - Beginning code generation
7250 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM `categories` AS t LIMIT 1
7279 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM `categories` AS t LIMIT 1
7281 [uber-SubtaskRunner] INFO org.apache.sqoop.orm.CompilationManager - HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
9303 [uber-SubtaskRunner] INFO org.apache.sqoop.orm.CompilationManager - Writing jar file: /tmp/sqoop-yarn/compile/4fd8773510dfe4082d136b2ab7d27eb3/categories.jar
9314 [uber-SubtaskRunner] WARN org.apache.sqoop.manager.MySQLManager - It looks like you are importing from mysql.
9314 [uber-SubtaskRunner] WARN org.apache.sqoop.manager.MySQLManager - This transfer can be faster! Use the --direct
9314 [uber-SubtaskRunner] WARN org.apache.sqoop.manager.MySQLManager - option to exercise a MySQL-specific fast path.
9314 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.MySQLManager - Setting zero DATETIME behavior to convertToNull (mysql)
9318 [uber-SubtaskRunner] INFO org.apache.sqoop.mapreduce.ImportJobBase - Beginning import of categories
9388 [uber-SubtaskRunner] WARN org.apache.sqoop.mapreduce.JobBase - SQOOP_HOME is unset. May not be able to find all job dependencies.
10238 [uber-SubtaskRunner] INFO org.apache.sqoop.mapreduce.db.DBInputFormat - Using read commited transaction isolation
29055 [uber-SubtaskRunner] INFO org.apache.sqoop.mapreduce.ImportJobBase - Transferred 1.0049 KB in 19.659 seconds (52.3425 bytes/sec)
29061 [uber-SubtaskRunner] INFO org.apache.sqoop.mapreduce.ImportJobBase - Retrieved 58 records.
29076 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM `categories` AS t LIMIT 1
29097 [uber-SubtaskRunner] INFO org.apache.sqoop.hive.HiveImport - Loading uploaded data into Hive
Intercepting System.exit(1)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://quickstart.cloudera:8020/user/cloudera/oozie-oozi/0000012-161221020706124-oozie-oozi-W/sqoop-4467--sqoop/action-data.seq
Oozie Launcher ends
did you copy the hive-site.xml to HDFS ?that will do or you can import the table to hdfs path using --target-dir and set the location of hive table to point that path
grunt> table_load = load ‘test_table_one’ USING org.apache.hive.hcatalog.pig.HCatLoader();
grunt> dump table_load;
2016-10-05 17:25:43,798 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is
deprecated. Instead, use fs.defaultFS 2016-10-05 17:25:43,930 [main]
INFO hive.metastore – Trying to connect to metastore with URI
thrift://localhost:9084 2016-10-05 17:25:43,931 [main] INFO
hive.metastore – Opened a connection to metastore, current
connections: 1 2016-10-05 17:25:43,934 [main] INFO hive.metastore –
Connected to metastore. … 2016-10-05 17:25:58,707 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– HadoopJobId: job_1475669003352_0017 2016-10-05 17:25:58,707 [main]
INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– Processing aliases table_load 2016-10-05 17:25:58,707 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– detailed locations: M: table_load[7,13] C: R: 2016-10-05
17:25:58,716 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– 0% complete 2016-10-05 17:25:58,716 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– Running jobs are [job_1475669003352_0017] 2016-10-05 17:26:13,753
[main] WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– Ooops! Some job has failed! Specify -stop_on_failure if you want Pig
to stop immediately on failure. 2016-10-05 17:26:13,753 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– job job_1475669003352_0017 has failed! Stop running all dependent
jobs 2016-10-05 17:26:13,753 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– 100% complete 2016-10-05 17:26:13,882 [main] ERROR
org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil – 1 map reduce
job(s) failed! 2016-10-05 17:26:13,883 [main] INFO
org.apache.pig.tools.pigstats.mapreduce.SimplePigStats – Script
Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.15.0 hadoop 2016-10-05 17:25:57 2016-10-05 17:26:13 UNKNOWN
Failed!
Failed Jobs: JobId Alias Feature Message Outputs
job_1475669003352_0017 table_load MAP_ONLY Message: Job failed!
hdfs://mycluster/tmp/temp81690062/tmp2002161033,
Input(s): Failed to read data from “test_table_one”
Output(s): Failed to produce result in
“hdfs://mycluster/tmp/temp81690062/tmp2002161033”
Counters: Total records written : 0 Total bytes written : 0 Spillable
Memory Manager spill count : 0 Total bags proactively spilled: 0 Total
records proactively spilled: 0
Job DAG: job_1475669003352_0017
2016-10-05 17:26:13,883 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– Failed! 2016-10-05 17:26:13,889 [main] ERROR
org.apache.pig.tools.grunt.Grunt – ERROR 1066: Unable to open iterator
for alias table_load Details at logfile:
/home/hadoop/pig_1475674706670.log
Can you help me to find why it is happening to me.?
Either use pig -useHCatalog or use pig and REGISTER the supporting JARS for HCAT to work with grunt.
You can find the required jars that are been shared into HDFS when you use pig -useHCatalog.
grunt> table_load = load ‘test_table_one’ USING org.apache.hive.hcatalog.pig.HCatLoader();
grunt> dump table_load;
This may be the reason that you haven't created Hive table with the exact name. Check the hive table and schema for the same.
Before using Hcatlog we have to create table schema on top on the location from where we are loading the data. uSE any queue name if require. Before executing please check for the table in hive.
Hope it will help. Try
I am trying to run a simple Pig script and have scheduled it via Oozie , however , I get the following Oozie error after the script is run.
I am using Cloudera Enterprise Data Hub Edition Trial 5.6.0 (#54 built by jenkins on 20160211-1910 git: 1c2be84380aa23bd5d6993ec300e144c78b37bf2) .
> 2016-04-09 06:37:06,229 [uber-SubtaskRunner] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
> 2016-04-09 06:37:06,237 [uber-SubtaskRunner] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating
> job configuration.
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.PigMain], exit code [2]
> Oozie Launcher failed, finishing Hadoop job gracefully
> Oozie Launcher, uploading action data to HDFS sequence file: hdfs://node.xxxx.com:8020/user/admin/oozie-oozi/0000000-160409060732867-oozie-oozi-W/pig--pig/action-data.seq
EDIT.
Additional log info by using the oozie command shell as follows.
oozie job -log 0000001-160409063446097-oozie-oozi-W -oozie http://xxxnode:11000/oozie
Gives only the following
63446097-oozie-oozi-W] ACTION[0000001-160409063446097-oozie-oozi-W#FirstJob] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.PigMain], exit code [2]