I am following this guide below to install Apache Giraph.
http://giraph.apache.org/quick_start.html
I have followed all the steps correctly but when I try to run the wordcount example given in the link, the program gets stuck at
16/08/13 14:49:29 INFO input.FileInputFormat: Total input paths to process : 1
16/08/13 14:49:30 INFO mapred.JobClient: Running job: job_201608131448_0001
16/08/13 14:49:31 INFO mapred.JobClient: map 0% reduce 0%
16/08/13 14:49:45 INFO mapred.JobClient: map 100% reduce 0%
The reduce job doesn't start. How can I fix this?
Related
I am running my hive query on EMR cluster that which is 25 nodes cluster and i have used r4.4xlarge in stances to run this .
When i run my query i get below error .
Job Commit failed with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: FEAF40B78D086BEE; S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=), S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=)'
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.tez.TezTask
/mnt/var/lib/hadoop/steps/s-10YQZ5Z5PRUVJ/./hive-script:617:in `<main>': Error executing cmd: /usr/share/aws/emr/scripts/hive-script "--base-path" "s3://us-east-1.elasticmapreduce/libs/hive/" "--hive-versions" "latest" "--run-hive-script" "--args" "-f" "s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-22_App/d77a6a82-26f4-4f06-a1ea-e83677256a55/01/DeltaOutPut/processing/Scripts/script.sql" (RuntimeError)
Command exiting with ret '1'
I have tried settings all king of HIVE parameter combinations like below
emrfs-site fs.s3.consistent.retryPolicyType exponential
emrfs-site fs.s3.consistent.metadata.tableName EmrFSMetadataAlt
emrfs-site fs.s3.consistent.metadata.write.capacity 300
emrfs-site fs.s3.consistent.metadata.read.capacity 600
emrfs-site fs.s3.consistent true
hive-site hive.exec.stagingdir .hive-staging
hive-site hive.tez.java.opts -Xmx47364m
hive-site hive.stats.fetch.column.stats true
hive-site hive.stats.fetch.partition.stats true
hive-site hive.vectorized.execution.enabled false
hive-site hive.vectorized.execution.reduce.enabled false
hive-site tez.am.resource.memory.mb 15000
hive-site hive.auto.convert.join false
hive-site hive.compute.query.using.stats true
hive-site hive.cbo.enable true
hive-site tez.task.resource.memory.mb 16000
But every time it failed .
I tried increasing the no of nodes/bigger instances in the EMR cluster but result is still same .
I also tried with and without Tez but still did not worked for me .
Here is my sample query .I am just copying the part of my query
insert into filediffPcfp.TableDelta
Select rgt.FILLER1,rgt.DUNSNUMBER,rgt.BUSINESSNAME,rgt.TRADESTYLENAME,rgt.REGISTEREDADDRESSINDICATOR
Please help me identify the issue .
Adding full yarn logs
2019-02-26 06:28:54,318 [INFO] [TezChild] |exec.FileSinkOperator|: Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to temp file: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_task_tmp.-ext-10000/_tmp.000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: New Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,681 [INFO] [TezChild] |exec.FileSinkOperator|: FS[11]: records written - 1
2019-02-26 06:28:54,877 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000
2019-02-26 06:28:56,632 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 10000
2019-02-26 06:29:13,301 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 100000
2019-02-26 06:31:59,207 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000000
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: Attempting to abort attempt_1551161362408_0001_7_01_000000_1 due to an invocation of shutdownRequested
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Received abort
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Forwarding abort to RecordProcessor
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.MapRecordProcessor|: Forwarding abort to mapOp: {} MAP
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |exec.MapOperator|: Received abort in operator: MAP
2019-02-26 06:34:42,705 [INFO] [TezChild] |s3.S3FSInputStream|: Encountered exception while reading '2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/IncrFile/WB.ACTIVE.OCT17_01_OF_10.gz', will retry by attempting to reopen stream.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException:
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at com.amazon.ws.emr.hadoop.fs.s3n.InputStreamWithInfo.read(InputStreamWithInfo.java:173)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:136)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:179)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:163)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:255)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
Switch from Tez mode to MR. It should start working. Also remove all the tez related properties.
set hive.execution.engine=spark;
Let me answer my own question .
First thing very important that we have noticed while running HIVE jobs on EMR is that STEP error is misleading vertex failed will not point you in the correct direction .
So better to check for hive logs.
Now if our instance is terminated, then we can not log into master instance and see the logs , in that case we have to look for nodes application logs .
Here is how we can find that nodes logs .
Get the master instance id something like this (i-04d04d9a8f7d28fd1) and with that search for in nodes .
Then open below path
/applications/hive/user/hive/hive.log.gz
Here you can find the expected error .
Also we have to look for the containers logs for the failed nodes ,failed nodes details can be found in master instance node.
hadooplogs/j-25RSD7FFOL5JB/node/i-03f8a646a7ae97aae/daemons/
This daemons nodes logs can be found only if cluster is running else after terminating the cluster EMR does not pushes the logs into S3 log uri .
When i looked at it i got the real reason why it was failing .
For me this was the reason of failure
On checking the master instance's instance-controller logs, i saw there were multiple core instances went into un-healthy state :
2019-02-27 07:50:03,905 INFO Poller: InstanceJointStatusMap contains 21 entries (R:21):
i-0131b7a6abd0fb8e7 1541s R 1500s ig-28 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com I: 18s Y:U 81s c: 0 am: 0 H:R 0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
i-01672279d170dafd3 1539s R 1500s ig-28 ip-10-97-54-69.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:R 79s c: 0 am:241664 H:R 0.7%
i-0227ac0f0932bd0b3 1539s R 1500s ig-28 ip-10-97-51-197.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:R 79s c: 0 am:241664 H:R 4.1%
i-02355f335c190be40 1544s R 1500s ig-28 ip-10-97-52-150.tr-fr-nonprod.aws-int.thomsonreuters.com I: 22s Y:R 84s c: 0 am:241664 H:R 0.2%
i-024ed22b6affdd5ec 1540s R 1500s ig-28 ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:U 79s c: 0 am: 0 H:R 0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
Also after some time yarn Blacklisted the Core instances:
2019-02-27 07:46:39,676 INFO Poller: Determining health status for App Monitor: aws157.instancecontroller.apphealth.monitor.YarnMonitor
2019-02-27 07:46:39,688 INFO Poller: SlaveRecord i-0ac26bd7886fec338 changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,695 INFO Poller: SlaveRecord i-0131b7a6abd0fb8e7 changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,695 INFO Poller: Update SlaveRecordDbRow for i-0131b7a6abd0fb8e7 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com
2019-02-27 07:47:13,696 INFO Poller: SlaveRecord i-024ed22b6affdd5ec changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,696 INFO Poller: Update SlaveRecordDbRow for i-024ed22b6affdd5ec ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com
On checking the instance nodes instance-controller logs, I can see the /mnt got full due to job caching and usage went beyond threshold i.e 90% by default.
Because of this yarn :
2019-02-27 07:40:52,231 INFO dsm-1: /mnt total 27633 MB free 2068 MB used 25565 MB
2019-02-27 07:40:52,231 INFO dsm-1: / total 100663 MB free 97932 MB used 2731 MB
2019-02-27 07:40:52,231 INFO dsm-1: cycle 17 /mnt/var/log freeSpaceMb: 2068/27633 MB freeRatio:0.07
2019-02-27 07:40:52,248 INFO dsm-1: /mnt/var/log stats :
-> As in my dataset, source table is having .gz compression. As .gz compressed files are non-splitable due to this 1 file is having 1 map task assigned to it. And as the map task will decompress the file in /mnt, it may also lead to that issue.
-> Processing a large amount of data in EMR needs some hive properties to be optimized. Below are the few optimization property can be set in the cluster to make the query run in a better way.
V.V.V.V.V.I
Increase the EBS volume size for Core instances
Important is that we have to increase the EBS voulume for each core not alone for the master because EBS volume is where /mnt gets mounted not on the route .
This alone has solved my problem but below configuration also helped me optimize the HIVE jobs
hive-site.xml
-------------
"hive.exec.compress.intermediate" : "true",
"hive.intermediate.compression.codec" : "org.apache.hadoop.io.compress.SnappyCodec",
"hive.intermediate.compression.type" : "BLOCK"
yarn-site.xml
-------------
"max-disk-utilization-per-disk-percentage" : "99"
And this has resolved my issue permanently .
Hope some one will get benefited with my answer
I'm creating an uber jar spark application that I'm spark submitting to an EMR 4.3 cluster, I'm provisioning 4 r3.xlarge instances, one to be the master and the other three as the cores.
I have hadoop 2.7.1, ganglia 3.7.2 spark 1.6, and hive 1.0.0 pre-installed from the console.
I'm running the following command:
spark-submit \
--deploy-mode cluster \
--executor-memory 4g \
--executor-cores 2 \
--num-executors 4
--driver-memory 4g
--driver-cores 2
--conf "spark.driver.maxResultSize=2g"
--conf "spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet.DirectParquetOutputCommitter"
--conf "spark.shuffle.memoryFraction=0.2"
--class com.jackar.spark.Main spark-app.jar args
I realise that I'm not fully utilising the cluster, but at this point I'm not exactly trying to tune (or maybe that's what I should be doing?). My underlying job does something along the lines of:
1) Read parquet files from s3 that represent two datasets, run registerTempTable on the Dataframes, then run cacheTable on each. They are each about 300 mb in memory. (note: I've tried this using EMRs s3:// protocol as well as s3a://)
2) Use spark sql to run aggregations (i.e. sums and group bys).
3) Write the results to s3 as parquet files.
The jobs are running just fine when I look at the Spark UI, and they take about as long as I expect them to. The problem is that after the write-agg-to-parquet-in-s3 job finishes (Job tab), there is a period of time where no other jobs get queued up.
If I then go to the SQL tab in the Spark UI, I'll notice there is a "running query" for the same job that the Jobs tab says has completed. When I click in and look at the DAG for that query, I notice that the DAG seems to already be evaluated.
However, this query takes minutes and sometimes causes the entire spark application to restart and eventually fail...
I began to do some investigating to see if I could figure out the issue because on my Databricks trial this job is incredibly quick to execute, the DAG being identical to that on EMR (as expected). But I can't bring myself to justify using Databricks when I have no idea why I'm not seeing similar performance in EMR.
Maybe it's my JVM params? Garbage collection for instance? Time to check the executor logs.
2016-02-23T18:25:48.598+0000: [GC2016-02-23T18:25:48.598+0000: [ParNew: 299156K->30449K(306688K), 0.0255600 secs] 1586767K->1329022K(4160256K), 0.0256610 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]
2016-02-23T18:25:50.422+0000: [GC2016-02-23T18:25:50.422+0000: [ParNew: 303089K->32739K(306688K), 0.0263780 secs] 1601662K->1342494K(4160256K), 0.0264830 secs] [Times: user=0.07 sys=0.01, real=0.02 secs]
2016-02-23T18:25:52.223+0000: [GC2016-02-23T18:25:52.223+0000: [ParNew: 305379K->29373K(306688K), 0.0297360 secs] 1615134K->1348874K(4160256K), 0.0298410 secs] [Times: user=0.08 sys=0.00, real=0.03 secs]
2016-02-23T18:25:54.247+0000: [GC2016-02-23T18:25:54.247+0000: [ParNew: 302013K->28521K(306688K), 0.0220650 secs] 1621514K->1358123K(4160256K), 0.0221690 secs] [Times: user=0.06 sys=0.01, real=0.02 secs]
2016-02-23T18:25:57.994+0000: [GC2016-02-23T18:25:57.994+0000: [ParNew: 301161K->23609K(306688K), 0.0278800 secs] 1630763K->1364319K(4160256K), 0.0279460 secs] [Times: user=0.07 sys=0.01, real=0.03 secs]
Okay. That doesn't look good. Parnew stops the world and it happens every couple of seconds.
Next step, looking into the Spark UI on Databricks to see if the gc configuration is any different than EMRs. I found something kind of interesting. Databricks sets the spark.executor.extraJavaOptions to:
-XX:ReservedCodeCacheSize=256m
-XX:+UseCodeCacheFlushing
-javaagent:/databricks/DatabricksAgent.jar
-XX:+PrintFlagsFinal
-XX:+PrintGCDateStamps
-verbose:gc
-XX:+PrintGCDetails
-XX:+HeapDumpOnOutOfMemoryError
-Ddatabricks.serviceName=spark-executor-1
Hey, I'm no gc expert and I added "learn to tune gc" to my todo list, but what I see here is more than just gc params for the executors. What does the DatabricksAgent.jar do - does that help? I'm not sure, so I force my spark job to use the java options for the executors minus the databricks specific stuff:
--conf spark.executor.extraJavaOptions="-XX:ReservedCodeCacheSize=256m -XX:+UseCodeCacheFlushing -XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError"
This doesn't change the "running query" behaviour - it still takes forever - but I do get PSYoungGen instead of Parnew (frequency is still every couple of seconds though):
2016-02-23T19:40:58.645+0000: [GC [PSYoungGen: 515040K->12789K(996352K)] 1695803K->1193777K(3792896K), 0.0203380 secs] [Times: user=0.03 sys=0.01, real=0.02 secs]
2016-02-23T19:57:50.463+0000: [GC [PSYoungGen: 588789K->13391K(977920K)] 1769777K->1196033K(3774464K), 0.0237240 secs] [Times: user=0.04 sys=0.00, real=0.02 secs]
If you've read up to here, I commend you. I know how long of a post this is.
Another symptom I found is that while the query is running stderr and stdout are in a standstill and no new log lines are added on any executor (including the driver).
i.e.
16/02/23 19:41:23 INFO ContextCleaner: Cleaned shuffle 5
16/02/23 19:57:32 INFO DynamicPartitionWriterContainer: Job job_201602231940_0000 committed.
That same ~17 minute gap is accounted for in the Spark UI as the running query... Any idea what's going on?
In the end this job tends to restart after a few aggs get written to S3 (say 10% of them), and then eventually the spark application fails.
I'm not sure if the issue is related to the fact that EMR runs on YARN while Databricks runs on a Standalone cluster or if it's completely unrelated.
The failure I end up getting after I look into the yarn logs is the following:
java.io.FileNotFoundException: No such file or directory: s3a://bucket_file_stuff/_temporary/0/task_201602232109_0020_m_000000/
Any advice is much appreciated. I'll add notes as I go. Thanks!
I launched two m1.medium nodes on amazon ec2 for executing my pig script, but looks like it failed at the first line (even before MapReduce start): raw = LOAD 's3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000' USING TextLoader as (line:chararray);
The error message I got:
2015-02-04 02:15:39,804 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-02-04 02:15:39,821 [JobControl] INFO org.apache.hadoop.mapred.JobClient - Default number of map tasks: null
2015-02-04 02:15:39,822 [JobControl] INFO org.apache.hadoop.mapred.JobClient - Setting default number of map tasks based on cluster size to : 20
... (omitted)
2015-02-04 02:18:40,955 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-02-04 02:18:40,956 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201502040202_0002 has failed! Stop running all dependent jobs
2015-02-04 02:18:40,956 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-02-04 02:18:40,997 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Java heap space
2015-02-04 02:18:40,997 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-02-04 02:18:40,997 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 1.0.3 0.11.1.1-amzn hadoop 2015-02-04 02:15:32 2015-02-04 02:18:40 GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201502050202_0002 ngroup,raw,triples,tt GROUP_BY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201502050202_0002_m_000022
Input(s):
Failed to read data from "s3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
I think the code should be fine since I have ever successfully loaded other data with the same syntax, and the link to s3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000 looks valid. I suspect it might be related to some of my EC2 settings, but not sure how to investigate further or narrow down the problem. Anyone has a clue?
"Java heap space" error message gives some clues. Your files seem to be quite large (~2GB). Make sure that you have enough memory for each task runner to read the data.
The problem was currently solved by changing my node from m1.medium to m3.large , thanks for the good hint from #Nat as he pointed out the error message regarding with java heap space. I'll update more details later.
A pig script (not particularly more complex than any others I have built) before the job starts it seems to loop on this for a long time:
2013-10-08 10:46:07,655 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 10
2013-10-08 10:46:07,659 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 10
2013-10-08 10:46:09,168 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 10
2013-10-08 10:46:09,168 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 10
2013-10-08 10:46:11,381 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 10
2013-10-08 10:46:11,381 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 10
2013-10-08 10:46:13,875 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 10
2013-10-08 10:46:13,875 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 10
2013-10-08 10:46:16,303 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 10
It repeats the above for around 4 minutes when usually this step is completed in seconds. I have not been able to identify the cause - other than removing parts of the script but the issue does not seem to be caused by any particular part of the script. I have other scripts as complex as this one and I have not had this problem. What could be causing the issue?
I can't say for certain without more information, but it appears that pig is waiting for your cluster's JobTracker to start running the underlying Map/Reduce jobs generated by your script. There are numerious reasons why this could be happening such as running on a shared cluster which has run out of resources. You'll most likely have to look at your cluster's JobTracker and/or TaskTrackers to know the exact reason.
Using Apache Pig version 0.10.1.21 (reported),
CentOS release 6.3 (Final), jdk1.6.0_31 (The Hortonworks Sandbox v1.2 on Virtualbox, with 3.5 GB RAM)
$ cat data.txt
11,11,22
33,34,35
47,0,21
33,6,51
56,6,11
11,25,67
$ cat GrpTest.pig
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
DESCRIBE B;
DUMP B;
pig -x local GrpTest.pig
[Thread-12] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
[Thread-12] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
[Thread-13] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#19a9bea3
[Thread-13] INFO org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
[Thread-13] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0002
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
[main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B
The java.lang.OutOfMemoryError: Java heap space error occurs each time I use GROUP or JOIN in a pig script executed in local mode. There is no error when the script is executed in mapreduce mode on HDFS.
Question 1: How come there is an OutOfMemory error while the data sample is minuscule and local mode is supposed to use less resources than HDFS mode?
Question 2: Is there a solution to run successfully a small pig scripts with GROUP or JOIN in local mode?
Solution: force pig to allocate less memory for the java property io.sort.mb
I set to 10 MB here and the error disappears. Not sure what would be the best value but at least, this allow to practice pig syntax in local mode
$ cat GrpTest.pig
--avoid java.lang.OutOfMemoryError: Java heap space (execmode: -x local)
set io.sort.mb 10;
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
DESCRIBE B;
DUMP B;
The reason is you have less memory allocated to Java locally than you do on your Hadoop cluster machines. This is actually a pretty common error in Hadoop. It mostly occurs when you create a really long relation in Pig at any point, and happens because Pig always wants to load an entire relation into memory and doesn't want to lazy load it in any way.
When you do something like GROUP BY where the tuple you're grouping by is non-sparse over many records, you frequently wind up creating single long relations at least temporarily since you're basically taking a whole bunch of individual relations and cramming them all into one single long relation. Either change your code so you don't wind up creating single very long relations at any point (i.e. group by something more sparse), or increase the memory available to Java.