Error while executing ForEach - Apache PIG - apache-pig
I have 3 logs , a Squid , a login and a logoff. I need to cross these logs to find out which sites each user has accessed.
I'm using the Apache Pig and created the following scrip to do it:
copyFromLocal /home/marcelo/Documentos/hadoop/squid.txt /tmp/squid.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_in /tmp/login.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_out /tmp/logout.txt;
squid = LOAD '/tmp/squid.txt' USING PigStorage AS (linha: chararray);
nsquid = FOREACH squid GENERATE FLATTEN (STRSPLIT(linha,'[ ]+'));
nsquid = FOREACH nsquid GENERATE $0 AS timeStamp:chararray, $2 AS ipCliente:chararray, $5 AS request:chararray, $6 AS url:chararray;
nsquid = FOREACH nsquid GENERATE FLATTEN (STRSPLIT(timeStamp,'[.]'))AS (timeStamp:int,resto:chararray),ipCliente,request,url;
nsquid = FOREACH nsquid GENERATE (int)$0 AS timeStamp:int, $2 AS ipCliente:chararray,$3 AS request:chararray, $4 AS url:chararray;
connect = FILTER nsquid BY (request=='CONNECT');
login = LOAD '/tmp/login.txt' USING PigStorage(' ') AS (serverAL: chararray, data: chararray, hora: chararray, netlogon: chararray, on: chararray, ip: chararray);
nlogin = FOREACH login GENERATE FLATTEN(STRSPLIT(serverAL,'[\\\\]')),data, hora,FLATTEN(STRSPLIT(ip,'[\\\\]'));
nlogin = FOREACH nlogin GENERATE $1 AS al:chararray, $2 AS data:chararray, $3 AS hora:chararray, $4 AS ipCliente:chararray;
logout = LOAD '/tmp/logout.txt' USING PigStorage(' ') AS (data: chararray, hora: chararray, logout: chararray, ipAl: chararray, disconec: chararray);
nlogout = FOREACH logout GENERATE data, hora, FLATTEN(STRSPLIT(ipAl,'[\\\\]'));
nlogout = FOREACH nlogout GENERATE $0 AS data:chararray,$1 AS hora:chararray,$2 AS ipCliente:chararray, $3 AS al:chararray;
data = JOIN nlogin BY (al,ipCliente,data), nlogout BY (al,ipCliente,data);
ndata = FOREACH data GENERATE nlogin::al,ToUnixTime(ToDate(CONCAT(nlogin::data, nlogin::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogin:int,ToUnixTime(ToDate(CONCAT(nlogout::data, nlogout::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogout:int,nlogout::ipCliente;
BB = FOREACH ndata GENERATE $0 AS al:chararray, (int)$1 AS tslogin:int, (int)$2 AS tslogout:int, $3 AS ipCliente:chararray;
CC = JOIN BB BY ipCliente, connect BY ipCliente;
DD = FOREACH CC GENERATE BB::al AS al:chararray, (int)BB::tslogin AS tslogin:int, (int)BB::tslogout AS tslogout:int,(int)connect::timeStamp AS timeStamp:int, connect::ipCliente AS ipCliente:chararray, connect::url AS url:chararray;
EE = FILTER DD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);
STORE EE INTO 'EEs';
But it is returning the following error
2015-10-16 21:58:10,600 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-10-16 21:58:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201510162141_0008 has failed! Stop running all dependent jobs
2015-10-16 21:58:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-10-16 21:58:10,667 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 0: Error while executing ForEach at [DD[93,5]]
2015-10-16 21:58:10,667 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-10-16 21:58:10,667 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.2.1 0.12.1 root 2015-10-16 21:56:48 2015-10-16 21:58:10 HASH_JOIN,FILTER
Some jobs have failed! Stop running all dependent jobs
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_201510162141_0007 2 1 4 3 4 4 9 9 9 9 BB,data,login,logout,ndata,nlogin,nlogout HASH_JOIN
Failed Jobs:
JobId Alias Feature Message Outputs
job_201510162141_0008 CC,DD,EE,connect,nsquid,squid HASH_JOIN Message: Job failed! Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201510162141_0008_r_000000 hdfs://localhost:9000/user/root/EEb,
Input(s):
Successfully read 7367 records from: "/tmp/login.txt"
Successfully read 7374 records from: "/tmp/logout.txt"
Failed to read data from "/tmp/squid.txt"
Output(s):
Failed to produce result in "hdfs://localhost:9000/user/root/EEb"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201510162141_0007 -> job_201510162141_0008,
job_201510162141_0008
2015-10-16 21:58:10,674 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 11 time(s).
2015-10-16 21:58:10,674 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
I had created an alternative that worked, just replaced the penultimate line by:
STORE DD INTO 'DD';
newDD = LOAD 'hdfs://localhost:9000/user/root/DD' USING PigStorage AS (al:chararray, tslogin:int, tslogout:int, timeStamp:int, ipCliente:chararray, url:chararray);
EE = FILTER newDD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);
Does anyone have any idea how to fix it without the “store” ?
Related
Pig script warning, while trying to do any FOREACH i am getting this warnings
grunt> a = load '/user/horton/flightdelays_clean/part-m-00000' using PigStorage(','); 2016-10-12 15:22:25,593 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS grunt> b = group a by $0; grunt> c = foreach b generate COUNT($0); 2016-10-12 15:22:40,244 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s). 2016-10-12 15:22:40,248 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_BAG 1 time(s).
You are passing a field as an argument to a function that requires some type but the field is another type. Try this: grunt> b = group a by $0; grunt> c = foreach b generate COUNT(a);
c = foreach b generate COUNT($0); Should be c = foreach b generate COUNT(a);
Apache Pig: OutOfMemoryError with FOREACH aggregation
I'm getting an OutOfMemoryError with running a FOREACH operation in Apache Pig. 16/06/24 15:14:17 INFO util.SpillableMemoryManager: first memory handler call- Usage threshold init = 164102144(160256K) used = 556137816(543103K) committed = 698875904(682496K) max = 698875904(682496K) java.lang.OutOfMemoryError: Java heap space -XX:OnOutOfMemoryError="kill -9 %p" Executing /bin/sh -c "kill -9 4095"... Killed My Pig script: A = LOAD 'PageCountTest/' USING PigStorage(' ') AS (Project:chararray, Title:chararray, count:int , size:int); B = GROUP A BY (Project,Title); C = FOREACH B generate group, SUM(A.count) AS COUNT; D = ORDER C BY COUNT DESC; STORE C INTO '/user/hadoop/wikistats'; Sample Data: aa.b Main_Page 1 14335 aa.d India 1 4075 aa.d Main_Page 1 13190 aa.d Special:RecentChanges 1 200 aa.d Talk:Main_Page/ 1 14147 aa.d w/w/index.php 9 137502 aa Main_Page 6 9872 aa Special:Statistics 1 324 Can anyone please help?
I do suspect that a memory issue arise when you order by as it is the most heavy weighted. The easiest way is to play with the heap size parameters while launching your pig job. pig -Dmapred.child.java.opts=-Xms2096M yourjob.pig Also you can declare heap size directly inside the script. export PIG_HEAPSIZE=2096
How to convert fields into bags and tuples in PIG?
I have a dataset which has comma seperated values as: 10,4,21,9,50,9,4,50 50,78,47,7,4,7,4,50 68,25,43,13,11,68,10,9 I want to convert this into Bags and tuples as shown below: ({(10),(4),(21),(9),(50)},{(9),(4),(50)}) ({(50),(78),(45),(7),(4)},{(7),(4),(50)}) ({(68),(25),(43),(13),(11)},{(68),(10),(9)}) I have tried the below command but it does not show any data. grunt> dataset = load '/user/dataset' Using PigStorage(',') As (bag1:bag{t1:tuple(p1:int, p2:int, p3:int, p4:int, p5:int)}, bag2:bag{t2:tuple(p6:int, p7:int, p8:int)}); grunt> dump dataset; Below is the output of dump: 2015-09-11 05:26:31,057 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 8 time(s). 2015-09-11 05:26:31,057 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2015-09-11 05:26:31,058 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2015-09-11 05:26:31,058 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2015-09-11 05:26:31,063 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-09-11 05:26:31,063 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 (,) (,) (,) (,) Please help. How can I convert the dataset into bags and tuples.
Got the solution. I have used the below command: grunt> dataset = load '/user/dataset' Using PigStorage(',') As (p1:int, p2:int, p3:int, p4:int, p5:int, p6:int, p7:int, p8:int); grunt> dataset2 = Foreach dataset Generate TOBAG(p1, p2, p3, p4, p5) as bag1, TOBAG(p6, p7, p8) as bag2; grunt> dump dataset2;
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
While I am trying to overwrite Hive managed table to external table I am getting below error. Query ID = hdfs_20150701054444_576849f9-6b25-4627-b79d-f5defc13c519 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1435327542589_0049, Tracking URL = http://ip-XXX.XX.XX.XX ec2.internal:8088/proxy/application_1435327542589_0049/ Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435327542589_0049 Hadoop job information for Stage-0: number of mappers: 0; number of reducers: 0 2015-07-01 05:44:11,676 Stage-0 map = 0%, reduce = 0% Ended Job = job_1435327542589_0049 with errors Error during job, obtaining debugging information... FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce Jobs Launched: Stage-Stage-0: HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec
pig latin - not showing the right record numbers
I have written a pig script for wordcount which works fine. I could see the results from pig script in my output directory in hdfs. But towards the end of my console, I see the following: Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_local1695568121_0002 1 1 0 0 0 0 0 0 0 0 words_sorted SAMPLER job_local2103470491_0003 1 1 0 0 0 0 0 0 0 0 words_sorted ORDER_BY /output/result_pig, job_local696057848_0001 1 1 0 0 0 0 0 0 0 0 book,words,words_agg,words_grouped GROUP_BY,COMBINER Input(s): Successfully read 0 records from: "/data/pg5000.txt" Output(s): Successfully stored 0 records in: "/output/result_pig" Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_local696057848_0001 -> job_local1695568121_0002, job_local1695568121_0002 -> job_local2103470491_0003, job_local2103470491_0003 2014-07-01 14:10:35,241 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! As you can see, the job is success. but not the Input(s) and output(s). Both of the them say successfully read/stored 0 records and the counter values are all 0. why the value is zero. These should not be zero. I am using hadoop2.2 and pig-0.12 Here is the script: book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray); words = foreach book generate FLATTEN(TOKENIZE(lines)) as word; words_grouped = group words by word; words_agg = foreach words_grouped generate group as word, COUNT(words); words_sorted = ORDER words_agg BY $1 DESC; STORE words_sorted into '/output/result_pig' using PigStorage(':','-schema'); NOTE: my data is present in /data/pg5000.txt and not in default directory which is /usr/name/data/pg5000.txt EDIT: here is the output of printing my file to console hadoop fs -cat /data/pg5000.txt | head -10 The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete by Leonardo Da Vinci (#3 in our series by Leonardo Da Vinci) Copyright laws are changing all over the world. Be sure to check the copyright laws for your country before downloading or redistributing this or any other Project Gutenberg eBook. This header should be the first thing seen when viewing this Project Gutenberg file. Please do not remove it. Do not change or edit the cat: Unable to write to output stream.
Please correct the following line book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray); to book = load '/data/pg5000.txt' using PigStorage(',') as (lines:chararray); I am assuming the delimiter as comma here use the one which is used to separate the records in your file. This will solve the issue Also note -- If no argument is provided, PigStorage will assume tab-delimited format. If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter.