I know this is one of the most repeated question. I have looked almost everywhere and none of the resources could resolve the issue I am facing.
Below is the simplified version of my problem statement. But in actual data is little complex so I have to use UDF
My input File: (input.txt)
NotNeeded1,NotNeeded11;Needed1
NotNeeded2,NotNeeded22;Needed2
I want the output to be
Needed1
Needed2
So, I am writing the below UDF
(Java code):
package com.company.pig;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class myudf extends EvalFunc<String>{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
String s = (String)input.get(0);
String str = s.split("\\,")[1];
String str1 = str.split("\\;")[1];
return str1;
}
}
And packaging it into
rollupreg_extract-jar-with-dependencies.jar
Below is my pig shell code
grunt> REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
grunt> DEFINE myudf com.company.pig.myudf;
grunt> data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(',');
grunt> extract = FOREACH data GENERATE myudf($1);
grunt> DUMP extract;
And I get the below error:
2017-05-15 15:58:15,493 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2017-05-15 15:58:15,577 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2017-05-15 15:58:15,659 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2017-05-15 15:58:15,774 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2017-05-15 15:58:15,865 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2017-05-15 15:58:15,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2017-05-15 15:58:15,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2017-05-15 15:58:16,184 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:16,196 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:16,396 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:16,576 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2017-05-15 15:58:16,580 [main] WARN org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
2017-05-15 15:58:16,584 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2017-05-15 15:58:16,588 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2017-05-15 15:58:17,258 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/pig/rollupreg_extract-jar-with-dependencies.jar to DistributedCache through /tmp/temp-1119775568/tmp-858482998/rollupreg_extract-jar-with-dependencies.jar
2017-05-15 15:58:17,276 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2017-05-15 15:58:17,294 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2017-05-15 15:58:17,295 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2017-05-15 15:58:17,295 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2017-05-15 15:58:17,354 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2017-05-15 15:58:17,510 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:17,511 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:17,511 [JobControl] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:17,753 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2017-05-15 15:58:17,820 [JobControl] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
2017-05-15 15:58:17,830 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-05-15 15:58:17,830 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2017-05-15 15:58:17,884 [JobControl] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2017-05-15 15:58:17,889 [JobControl] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev 7a4b57bedce694048432dd5bf5b90a6c8ccdba80]
2017-05-15 15:58:17,922 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2017-05-15 15:58:18,525 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2017-05-15 15:58:18,692 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1494853652295_0023
2017-05-15 15:58:18,879 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2017-05-15 15:58:18,973 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1494853652295_0023
2017-05-15 15:58:19,029 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1494853652295_0023/
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1494853652295_0023
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases data,extract
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[2,7],extract[3,10] C: R:
2017-05-15 15:58:19,044 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2017-05-15 15:58:19,044 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1494853652295_0023]
2017-05-15 15:58:29,156 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2017-05-15 15:58:29,156 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1494853652295_0023 has failed! Stop running all dependent jobs
2017-05-15 15:58:29,157 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2017-05-15 15:58:29,790 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:29,791 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:29,793 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:30,311 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:30,312 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:30,313 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:30,465 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2017-05-15 15:58:30,467 [main] WARN org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
2017-05-15 15:58:30,472 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.7.3.2.5.0.0-1245 root 2017-05-15 15:58:16 2017-05-15 15:58:30 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_1494853652295_0023 data,extract MAP_ONLY Message: Job failed! hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225,
Input(s):
Failed to read data from "/pig_hdfs/input.txt"
Output(s):
Failed to produce result in "hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1494853652295_0023
2017-05-15 15:58:30,472 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2017-05-15 15:58:30,499 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias extract
Details at logfile: /pig/pig_1494863836458.log
I know it complaints that
Failed to read data from "/pig_hdfs/input.txt"
But I am sure this is not the actual issue. If I don't use the udf and directly dump the data, I get the output. So, this is not the issue.
First, you do not need an udf to get the desired output.You can use semi colon as the delimiter in load statement and get the needed column.
data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(';');
extract = FOREACH data GENERATE $1;
DUMP extract;
If you insist on using udf then you will have to load the record into a single field and then use the udf.Also,your udf is incorrect.You should split the string s with ';' as the delimiter, which is passed from the pig script.
String s = (String)input.get(0);
String str1 = s.split("\\;")[1];
And in your pig script,you need to load the entire record into 1 field and use the udf on field $0.
REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
DEFINE myudf com.company.pig.myudf;
data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' AS (f1:chararray);
extract = FOREACH data GENERATE myudf($0);
DUMP extract;
Related
I have the following nextflow script:
echo true
wd = "$params.wd"
geoid = "$params.geoid"
process step1 {
publishDir = "$wd/data/"
input:
val celFiles from "$wd/data/$geoid"
output:
file "${geoid}_datFiles.RData" into channel
"""
Rscript $wd/scripts/step1.R $celFiles $wd/data/${geoid}_datFiles.RData
"""
}
The Rscript contains the following commands:
step1=function(WD,
celFiles,
output) {
library(affy)
datFiles=ReadAffy(celfile.path=paste0(WD,"/",celFiles))
save(datFiles,file=output)
}
args=commandArgs(trailingOnly=TRUE)
WD=args[1]
celFiles=args[2]
output=args[3]
step1(WD,celFiles,output)
When it runs, the output file is saved in the directory I want ($wd/data/${geoid}_datFiles.RData). Given that publishDir points to the same directory, I would expect output (defined as "${geoid}_datFiles.RData") to be available under the publishDir directory.
However, I get the following error:
Missing output file(s) `GSE4290_datFiles.RData` expected by process `step1`
The log file suggests that nextflow is still looking for the output in the workflow created directory:
Process `step1` is unable to find [UnixPath]: `/Users/rebeccaeliscu/Desktop/workflow/affymetrix/nextflow/work/92/42afb131a36eb32ed780bd1bf3bc3b/GSE4290_datFiles.RData`
The complete log file:
Nov-12 17:55:39.611 [main] DEBUG nextflow.cli.Launcher - $> nextflow run main.nf
Nov-12 17:55:39.945 [main] INFO nextflow.cli.CmdRun - N E X T F L O W ~ version 20.07.1
Nov-12 17:55:39.968 [main] INFO nextflow.cli.CmdRun - Launching `main.nf` [infallible_brahmagupta] - revision: d68e496ea0
Nov-12 17:55:40.026 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /Users/rebeccaeliscu/Desktop/workflow/affymetrix/nextflow/nextflow.config
Nov-12 17:55:40.029 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /Users/rebeccaeliscu/Desktop/workflow/affymetrix/nextflow/nextflow.config
Nov-12 17:55:40.140 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Nov-12 17:55:41.288 [main] DEBUG nextflow.Session - Session uuid: 94f22a74-2a63-4a87-9fb3-33cf925a5a74
Nov-12 17:55:41.288 [main] DEBUG nextflow.Session - Run name: infallible_brahmagupta
Nov-12 17:55:41.289 [main] DEBUG nextflow.Session - Executor pool size: 4
Nov-12 17:55:41.326 [main] DEBUG nextflow.cli.CmdRun -
Version: 20.07.1 build 5412
Created: 24-07-2020 15:18 UTC (08:18 PDT)
System: Mac OS X 10.15.7
Runtime: Groovy 2.5.11 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14
Encoding: UTF-8 (UTF-8)
Process: 46458#Rebeccas-MacBook-Pro-6.local.ucsf.edu [10.49.41.197]
CPUs: 4 - Mem: 8 GB (708.4 MB) - Swap: 2 GB (927 MB)
Nov-12 17:55:41.353 [main] DEBUG nextflow.Session - Work-dir: /Users/rebeccaeliscu/Desktop/workflow/affymetrix/nextflow/work [Mac OS X]
Nov-12 17:55:41.354 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /Users/rebeccaeliscu/Desktop/workflow/affymetrix/nextflow/bin
Nov-12 17:55:41.594 [main] DEBUG nextflow.Session - Observer factory: TowerFactory
Nov-12 17:55:41.598 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Nov-12 17:55:41.911 [main] DEBUG nextflow.Session - Session start invoked
Nov-12 17:55:42.309 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Nov-12 17:55:42.331 [main] DEBUG nextflow.Session - Workflow process names [dsl1]: step1
Nov-12 17:55:42.334 [main] WARN nextflow.script.BaseScript - The use of `echo` method has been deprecated
Nov-12 17:55:42.495 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: null
Nov-12 17:55:42.496 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'local'
Nov-12 17:55:42.508 [main] DEBUG nextflow.executor.Executor - [warm up] executor > local
Nov-12 17:55:42.521 [main] DEBUG n.processor.LocalPollingMonitor - Creating local task monitor for executor 'local' > cpus=4; memory=8 GB; capacity=4; pollInterval=100ms; dumpInterval=5m
Your output declaration is looking for a file in the current workDir: "${geoid}_datFiles.RData", but your Rscript is writing to: $wd/data/${geoid}_datFiles.RData. If you change your command to:
Rscript $wd/scripts/step1.R $celFiles ${geoid}_datFiles.RData
Then Nextflow should be able to find the output file. The publishDir directive will then 'publish' it to the defined publishDir.
My desktop application log4j.properties file is:
## Log levels
## TRACE < DEBUG < INFO < WARN < ERROR < FATAL
log4j.rootLogger=INFO
#
## Appender Configuration
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
#
## Pattern to output the caller's file name and line number
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%d{${datestamp}} %-5p %c{1}:%L - %m%n
And I run this application using java -jar appName.jar > <path-to-log-dir>/logFile.log.
The output for this file is, for instance:
0 [main] INFO br.com.mentium.hrm.agent.Agent - Thread started at: Wed Nov 30 09:53:03 BRST 2016
3 [main] INFO br.com.mentium.hrm.agent.Agent - HRM Agent
3 [main] INFO br.com.mentium.hrm.agent.Agent -
3 [main] INFO br.com.mentium.hrm.agent.Agent - Polling server every 1 minute(s).
3 [main] INFO br.com.mentium.hrm.agent.Agent -
4 [main] INFO br.com.mentium.hrm.agent.Agent - ######################
4 [main] INFO br.com.mentium.hrm.agent.Agent -
5 [main] INFO br.com.mentium.hrm.agent.Agent - Execution at Wed Nov 30 09:53:03 BRST 2016
5 [main] INFO br.com.mentium.hrm.agent.Agent - Iteration number: 1
5 [main] INFO br.com.mentium.hrm.agent.Agent -
Where the first number on each line is the time in milliseconds since the application was started. I guess.
I'd like to format the log's output as:
yyyy-MM-dd hh:mm:sss abbreviatedClassName (ie, b.c.m.h.a.ClassName) - message
I know I need to do it on the ConversionPattern line, but no changes I do to it seem to take effect.
What's wrong here?
You need to specify it like this. Note that this is not exactly to your requirement. I hope you can try it and get it to your exact requirement.
You can read more about pattern layout here.
log4j.appender.CONSOLE.layout.ConversionPattern=%d{dd MMM yyyy HH:mm:ss,SSS} %-5p %c{1}:%L - %m%n
grunt> table_load = load ‘test_table_one’ USING org.apache.hive.hcatalog.pig.HCatLoader();
grunt> dump table_load;
2016-10-05 17:25:43,798 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is
deprecated. Instead, use fs.defaultFS 2016-10-05 17:25:43,930 [main]
INFO hive.metastore – Trying to connect to metastore with URI
thrift://localhost:9084 2016-10-05 17:25:43,931 [main] INFO
hive.metastore – Opened a connection to metastore, current
connections: 1 2016-10-05 17:25:43,934 [main] INFO hive.metastore –
Connected to metastore. … 2016-10-05 17:25:58,707 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– HadoopJobId: job_1475669003352_0017 2016-10-05 17:25:58,707 [main]
INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– Processing aliases table_load 2016-10-05 17:25:58,707 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– detailed locations: M: table_load[7,13] C: R: 2016-10-05
17:25:58,716 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– 0% complete 2016-10-05 17:25:58,716 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– Running jobs are [job_1475669003352_0017] 2016-10-05 17:26:13,753
[main] WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– Ooops! Some job has failed! Specify -stop_on_failure if you want Pig
to stop immediately on failure. 2016-10-05 17:26:13,753 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– job job_1475669003352_0017 has failed! Stop running all dependent
jobs 2016-10-05 17:26:13,753 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– 100% complete 2016-10-05 17:26:13,882 [main] ERROR
org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil – 1 map reduce
job(s) failed! 2016-10-05 17:26:13,883 [main] INFO
org.apache.pig.tools.pigstats.mapreduce.SimplePigStats – Script
Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.15.0 hadoop 2016-10-05 17:25:57 2016-10-05 17:26:13 UNKNOWN
Failed!
Failed Jobs: JobId Alias Feature Message Outputs
job_1475669003352_0017 table_load MAP_ONLY Message: Job failed!
hdfs://mycluster/tmp/temp81690062/tmp2002161033,
Input(s): Failed to read data from “test_table_one”
Output(s): Failed to produce result in
“hdfs://mycluster/tmp/temp81690062/tmp2002161033”
Counters: Total records written : 0 Total bytes written : 0 Spillable
Memory Manager spill count : 0 Total bags proactively spilled: 0 Total
records proactively spilled: 0
Job DAG: job_1475669003352_0017
2016-10-05 17:26:13,883 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
– Failed! 2016-10-05 17:26:13,889 [main] ERROR
org.apache.pig.tools.grunt.Grunt – ERROR 1066: Unable to open iterator
for alias table_load Details at logfile:
/home/hadoop/pig_1475674706670.log
Can you help me to find why it is happening to me.?
Either use pig -useHCatalog or use pig and REGISTER the supporting JARS for HCAT to work with grunt.
You can find the required jars that are been shared into HDFS when you use pig -useHCatalog.
grunt> table_load = load ‘test_table_one’ USING org.apache.hive.hcatalog.pig.HCatLoader();
grunt> dump table_load;
This may be the reason that you haven't created Hive table with the exact name. Check the hive table and schema for the same.
Before using Hcatlog we have to create table schema on top on the location from where we are loading the data. uSE any queue name if require. Before executing please check for the table in hive.
Hope it will help. Try
I have installed CDH 5.3 cluster on Ubuntu, I respected all the configurations recommended by Cloudera, it has hadoop + HBase.
The problem arise when i try to load the data and dump it using PIG the job is still stagnate, and I always reload 0%
OS: Ubuntu 14.04 64
Parcel CDH 5.3 (or 5.5.1)
Job : a = load '/user/nadir/data.txt' ; dump a ;
logs:
2016-02-12 04: 06: 33.869 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1455246282704_0001
2016-02-12 04: 06: 33.869 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing has aliases
2016-02-12 04: 06: 33.869 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: a [1,4] C: R:
2016-02-12 04: 06: 34.121 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% Complete
I am running the Pig example from DataStax: http://www.datastax.com/docs/datastax_enterprise3.1/solutions/about_pig#pig-read-write. I am using DataStax Enterprise 3.1.2. But when I want to save the Data back in Cassandra with:
grunt> STORE insertformat INTO
'cql://cql3ks/test?output_query=UPDATE+cql3ks.test+set+b+%3D+%3F'
USING CqlStorage;
I get the following output:
2014-03-11 10:14:38,383 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2014-03-11 10:14:38,440 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-03-11 10:14:38,442 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-03-11 10:14:38,442 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-03-11 10:14:38,451 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-03-11 10:14:38,452 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-03-11 10:14:38,452 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job1332293282461754849.jar
2014-03-11 10:14:40,560 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job1332293282461754849.jar created
2014-03-11 10:14:40,569 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-03-11 10:14:40,597 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2014-03-11 10:14:41,111 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2014-03-11 10:14:43,934 [Thread-10] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2014-03-11 10:14:45,547 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201403091619_0036
2014-03-11 10:14:45,547 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://127.0.0.1:50030/jobdetails.jsp?jobid=job_201403091619_0036
2014-03-11 10:17:52,330 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201403091619_0036 has failed! Stop running all dependent jobs
2014-03-11 10:17:52,330 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-03-11 10:17:52,334 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: java.io.IOException: InvalidRequestException(why:Expected 4 or 0 byte int (11))
2014-03-11 10:17:52,335 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2014-03-11 10:17:52,335 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.4.8 0.9.2 root 2014-03-11 10:14:38 2014-03-11 10:17:52 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201403091619_0036 insertformat,moretestvalues MAP_ONLY Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201403091619_0036_m_000000 cql://cql3ks/test?output_query=UPDATE+cql3ks.test+set+b+%3D+%3F,
Input(s):
Failed to read data from "cql://cql3ks/moredata/"
Output(s):
Failed to produce result in "cql://cql3ks/test?output_query=UPDATE+cql3ks.test+set+b+%3D+%3F"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201403091619_0036
2014-03-11 10:17:52,335 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
The Log-File is:
Backend error message
---------------------
java.io.IOException: InvalidRequestException(why:Expected 4 or 0 byte int (11))
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:248)
Caused by: InvalidRequestException(why:Expected 4 or 0 byte int (11))
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result.read(Cassandra.java:41868)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1689)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1674)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:232)
What I am doing wrong? For me, it looks like a Bug, because when I use Strings instead of Integers in CQL while creating the Table, the Example works well.
Thank you
I just tested in with a fresh install of DSE-3.12, it works for me. You may need to re-install DSE and re-create the tables to test it again.