Error while executing ForEach - Apache PIG - apache-pig

I have 3 logs , a Squid , a login and a logoff. I need to cross these logs to find out which sites each user has accessed.
I'm using the Apache Pig and created the following scrip to do it:
copyFromLocal /home/marcelo/Documentos/hadoop/squid.txt /tmp/squid.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_in /tmp/login.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_out /tmp/logout.txt;
squid = LOAD '/tmp/squid.txt' USING PigStorage AS (linha: chararray);
nsquid = FOREACH squid GENERATE FLATTEN (STRSPLIT(linha,'[ ]+'));
nsquid = FOREACH nsquid GENERATE $0 AS timeStamp:chararray, $2 AS ipCliente:chararray, $5 AS request:chararray, $6 AS url:chararray;
nsquid = FOREACH nsquid GENERATE FLATTEN (STRSPLIT(timeStamp,'[.]'))AS (timeStamp:int,resto:chararray),ipCliente,request,url;
nsquid = FOREACH nsquid GENERATE (int)$0 AS timeStamp:int, $2 AS ipCliente:chararray,$3 AS request:chararray, $4 AS url:chararray;
connect = FILTER nsquid BY (request=='CONNECT');
login = LOAD '/tmp/login.txt' USING PigStorage(' ') AS (serverAL: chararray, data: chararray, hora: chararray, netlogon: chararray, on: chararray, ip: chararray);
nlogin = FOREACH login GENERATE FLATTEN(STRSPLIT(serverAL,'[\\\\]')),data, hora,FLATTEN(STRSPLIT(ip,'[\\\\]'));
nlogin = FOREACH nlogin GENERATE $1 AS al:chararray, $2 AS data:chararray, $3 AS hora:chararray, $4 AS ipCliente:chararray;
logout = LOAD '/tmp/logout.txt' USING PigStorage(' ') AS (data: chararray, hora: chararray, logout: chararray, ipAl: chararray, disconec: chararray);
nlogout = FOREACH logout GENERATE data, hora, FLATTEN(STRSPLIT(ipAl,'[\\\\]'));
nlogout = FOREACH nlogout GENERATE $0 AS data:chararray,$1 AS hora:chararray,$2 AS ipCliente:chararray, $3 AS al:chararray;
data = JOIN nlogin BY (al,ipCliente,data), nlogout BY (al,ipCliente,data);
ndata = FOREACH data GENERATE nlogin::al,ToUnixTime(ToDate(CONCAT(nlogin::data, nlogin::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogin:int,ToUnixTime(ToDate(CONCAT(nlogout::data, nlogout::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogout:int,nlogout::ipCliente;
BB = FOREACH ndata GENERATE $0 AS al:chararray, (int)$1 AS tslogin:int, (int)$2 AS tslogout:int, $3 AS ipCliente:chararray;
CC = JOIN BB BY ipCliente, connect BY ipCliente;
DD = FOREACH CC GENERATE BB::al AS al:chararray, (int)BB::tslogin AS tslogin:int, (int)BB::tslogout AS tslogout:int,(int)connect::timeStamp AS timeStamp:int, connect::ipCliente AS ipCliente:chararray, connect::url AS url:chararray;
EE = FILTER DD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);
STORE EE INTO 'EEs';
But it is returning the following error
2015-10-16 21:58:10,600 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-10-16 21:58:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201510162141_0008 has failed! Stop running all dependent jobs
2015-10-16 21:58:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-10-16 21:58:10,667 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 0: Error while executing ForEach at [DD[93,5]]
2015-10-16 21:58:10,667 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-10-16 21:58:10,667 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.2.1 0.12.1 root 2015-10-16 21:56:48 2015-10-16 21:58:10 HASH_JOIN,FILTER
Some jobs have failed! Stop running all dependent jobs
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_201510162141_0007 2 1 4 3 4 4 9 9 9 9 BB,data,login,logout,ndata,nlogin,nlogout HASH_JOIN
Failed Jobs:
JobId Alias Feature Message Outputs
job_201510162141_0008 CC,DD,EE,connect,nsquid,squid HASH_JOIN Message: Job failed! Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201510162141_0008_r_000000 hdfs://localhost:9000/user/root/EEb,
Input(s):
Successfully read 7367 records from: "/tmp/login.txt"
Successfully read 7374 records from: "/tmp/logout.txt"
Failed to read data from "/tmp/squid.txt"
Output(s):
Failed to produce result in "hdfs://localhost:9000/user/root/EEb"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201510162141_0007 -> job_201510162141_0008,
job_201510162141_0008
2015-10-16 21:58:10,674 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 11 time(s).
2015-10-16 21:58:10,674 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
I had created an alternative that worked, just replaced the penultimate line by:
STORE DD INTO 'DD';
newDD = LOAD 'hdfs://localhost:9000/user/root/DD' USING PigStorage AS (al:chararray, tslogin:int, tslogout:int, timeStamp:int, ipCliente:chararray, url:chararray);
EE = FILTER newDD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);
Does anyone have any idea how to fix it without the “store” ?

Related

Pig script warning, while trying to do any FOREACH i am getting this warnings

grunt> a = load '/user/horton/flightdelays_clean/part-m-00000' using PigStorage(',');
2016-10-12 15:22:25,593 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS
grunt> b = group a by $0;
grunt> c = foreach b generate COUNT($0);
2016-10-12 15:22:40,244 [main] WARN
org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning
USING_OVERLOADED_FUNCTION 1 time(s). 2016-10-12 15:22:40,248 [main]
WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning
IMPLICIT_CAST_TO_BAG 1 time(s).
You are passing a field as an argument to a function that requires some type but the field is another type.
Try this:
grunt> b = group a by $0;
grunt> c = foreach b generate COUNT(a);
c = foreach b generate COUNT($0);
Should be
c = foreach b generate COUNT(a);

Apache Pig: OutOfMemoryError with FOREACH aggregation

I'm getting an OutOfMemoryError with running a FOREACH operation in Apache Pig.
16/06/24 15:14:17 INFO util.SpillableMemoryManager: first memory
handler call- Usage threshold init = 164102144(160256K) used =
556137816(543103K) committed = 698875904(682496K) max =
698875904(682496K)
java.lang.OutOfMemoryError: Java heap space
-XX:OnOutOfMemoryError="kill -9 %p"
Executing /bin/sh -c "kill -9 4095"... Killed
My Pig script:
A = LOAD 'PageCountTest/' USING PigStorage(' ') AS (Project:chararray,
Title:chararray, count:int , size:int);
B = GROUP A BY (Project,Title);
C = FOREACH B
generate group, SUM(A.count) AS COUNT; D = ORDER C BY COUNT DESC;
STORE C INTO '/user/hadoop/wikistats';
Sample Data:
aa.b Main_Page 1 14335
aa.d India 1 4075
aa.d Main_Page 1 13190
aa.d Special:RecentChanges 1 200
aa.d Talk:Main_Page/ 1 14147
aa.d w/w/index.php 9 137502
aa Main_Page 6 9872
aa Special:Statistics 1 324
Can anyone please help?
I do suspect that a memory issue arise when you order by as it is the most heavy weighted. The easiest way is to play with the heap size parameters while launching your pig job.
pig -Dmapred.child.java.opts=-Xms2096M yourjob.pig
Also you can declare heap size directly inside the script.
export PIG_HEAPSIZE=2096

How to convert fields into bags and tuples in PIG?

I have a dataset which has comma seperated values as:
10,4,21,9,50,9,4,50
50,78,47,7,4,7,4,50
68,25,43,13,11,68,10,9
I want to convert this into Bags and tuples as shown below:
({(10),(4),(21),(9),(50)},{(9),(4),(50)})
({(50),(78),(45),(7),(4)},{(7),(4),(50)})
({(68),(25),(43),(13),(11)},{(68),(10),(9)})
I have tried the below command but it does not show any data.
grunt> dataset = load '/user/dataset' Using PigStorage(',') As (bag1:bag{t1:tuple(p1:int, p2:int, p3:int, p4:int, p5:int)}, bag2:bag{t2:tuple(p6:int, p7:int, p8:int)});
grunt> dump dataset;
Below is the output of dump:
2015-09-11 05:26:31,057 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 8 time(s).
2015-09-11 05:26:31,057 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-09-11 05:26:31,058 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-09-11 05:26:31,058 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-09-11 05:26:31,063 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-09-11 05:26:31,063 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(,)
(,)
(,)
(,)
Please help. How can I convert the dataset into bags and tuples.
Got the solution.
I have used the below command:
grunt> dataset = load '/user/dataset' Using PigStorage(',') As (p1:int, p2:int, p3:int, p4:int, p5:int, p6:int, p7:int, p8:int);
grunt> dataset2 = Foreach dataset Generate TOBAG(p1, p2, p3, p4, p5) as bag1, TOBAG(p6, p7, p8) as bag2;
grunt> dump dataset2;

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

While I am trying to overwrite Hive managed table to external table I am getting below error.
Query ID = hdfs_20150701054444_576849f9-6b25-4627-b79d-f5defc13c519
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1435327542589_0049, Tracking URL = http://ip-XXX.XX.XX.XX ec2.internal:8088/proxy/application_1435327542589_0049/
Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435327542589_0049
Hadoop job information for Stage-0: number of mappers: 0; number of reducers: 0
2015-07-01 05:44:11,676 Stage-0 map = 0%, reduce = 0%
Ended Job = job_1435327542589_0049 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-0: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

pig latin - not showing the right record numbers

I have written a pig script for wordcount which works fine. I could see the results from pig script in my output directory in hdfs. But towards the end of my console, I see the following:
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_local1695568121_0002 1 1 0 0 0 0 0 0 0 0 words_sorted SAMPLER
job_local2103470491_0003 1 1 0 0 0 0 0 0 0 0 words_sorted ORDER_BY /output/result_pig,
job_local696057848_0001 1 1 0 0 0 0 0 0 0 0 book,words,words_agg,words_grouped GROUP_BY,COMBINER
Input(s):
Successfully read 0 records from: "/data/pg5000.txt"
Output(s):
Successfully stored 0 records in: "/output/result_pig"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local696057848_0001 -> job_local1695568121_0002,
job_local1695568121_0002 -> job_local2103470491_0003,
job_local2103470491_0003
2014-07-01 14:10:35,241 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
As you can see, the job is success. but not the Input(s) and output(s). Both of the them say successfully read/stored 0 records and the counter values are all 0.
why the value is zero. These should not be zero.
I am using hadoop2.2 and pig-0.12
Here is the script:
book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray);
words = foreach book generate FLATTEN(TOKENIZE(lines)) as word;
words_grouped = group words by word;
words_agg = foreach words_grouped generate group as word, COUNT(words);
words_sorted = ORDER words_agg BY $1 DESC;
STORE words_sorted into '/output/result_pig' using PigStorage(':','-schema');
NOTE: my data is present in /data/pg5000.txt and not in default directory which is /usr/name/data/pg5000.txt
EDIT: here is the output of printing my file to console
hadoop fs -cat /data/pg5000.txt | head -10
The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete
by Leonardo Da Vinci
(#3 in our series by Leonardo Da Vinci)
Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.
This header should be the first thing seen when viewing this Project
Gutenberg file. Please do not remove it. Do not change or edit the
cat: Unable to write to output stream.
Please correct the following line
book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray);
to
book = load '/data/pg5000.txt' using PigStorage(',') as (lines:chararray);
I am assuming the delimiter as comma here use the one which is used to separate the records in your file. This will solve the issue
Also note --
If no argument is provided, PigStorage will assume tab-delimited format. If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter.