How to convert fields into bags and tuples in PIG? - apache-pig

I have a dataset which has comma seperated values as:
10,4,21,9,50,9,4,50
50,78,47,7,4,7,4,50
68,25,43,13,11,68,10,9
I want to convert this into Bags and tuples as shown below:
({(10),(4),(21),(9),(50)},{(9),(4),(50)})
({(50),(78),(45),(7),(4)},{(7),(4),(50)})
({(68),(25),(43),(13),(11)},{(68),(10),(9)})
I have tried the below command but it does not show any data.
grunt> dataset = load '/user/dataset' Using PigStorage(',') As (bag1:bag{t1:tuple(p1:int, p2:int, p3:int, p4:int, p5:int)}, bag2:bag{t2:tuple(p6:int, p7:int, p8:int)});
grunt> dump dataset;
Below is the output of dump:
2015-09-11 05:26:31,057 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 8 time(s).
2015-09-11 05:26:31,057 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-09-11 05:26:31,058 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-09-11 05:26:31,058 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-09-11 05:26:31,063 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-09-11 05:26:31,063 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(,)
(,)
(,)
(,)
Please help. How can I convert the dataset into bags and tuples.

Got the solution.
I have used the below command:
grunt> dataset = load '/user/dataset' Using PigStorage(',') As (p1:int, p2:int, p3:int, p4:int, p5:int, p6:int, p7:int, p8:int);
grunt> dataset2 = Foreach dataset Generate TOBAG(p1, p2, p3, p4, p5) as bag1, TOBAG(p6, p7, p8) as bag2;
grunt> dump dataset2;

Related

Pig script warning, while trying to do any FOREACH i am getting this warnings

grunt> a = load '/user/horton/flightdelays_clean/part-m-00000' using PigStorage(',');
2016-10-12 15:22:25,593 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS
grunt> b = group a by $0;
grunt> c = foreach b generate COUNT($0);
2016-10-12 15:22:40,244 [main] WARN
org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning
USING_OVERLOADED_FUNCTION 1 time(s). 2016-10-12 15:22:40,248 [main]
WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning
IMPLICIT_CAST_TO_BAG 1 time(s).
You are passing a field as an argument to a function that requires some type but the field is another type.
Try this:
grunt> b = group a by $0;
grunt> c = foreach b generate COUNT(a);
c = foreach b generate COUNT($0);
Should be
c = foreach b generate COUNT(a);

Error while executing ForEach - Apache PIG

I have 3 logs , a Squid , a login and a logoff. I need to cross these logs to find out which sites each user has accessed.
I'm using the Apache Pig and created the following scrip to do it:
copyFromLocal /home/marcelo/Documentos/hadoop/squid.txt /tmp/squid.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_in /tmp/login.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_out /tmp/logout.txt;
squid = LOAD '/tmp/squid.txt' USING PigStorage AS (linha: chararray);
nsquid = FOREACH squid GENERATE FLATTEN (STRSPLIT(linha,'[ ]+'));
nsquid = FOREACH nsquid GENERATE $0 AS timeStamp:chararray, $2 AS ipCliente:chararray, $5 AS request:chararray, $6 AS url:chararray;
nsquid = FOREACH nsquid GENERATE FLATTEN (STRSPLIT(timeStamp,'[.]'))AS (timeStamp:int,resto:chararray),ipCliente,request,url;
nsquid = FOREACH nsquid GENERATE (int)$0 AS timeStamp:int, $2 AS ipCliente:chararray,$3 AS request:chararray, $4 AS url:chararray;
connect = FILTER nsquid BY (request=='CONNECT');
login = LOAD '/tmp/login.txt' USING PigStorage(' ') AS (serverAL: chararray, data: chararray, hora: chararray, netlogon: chararray, on: chararray, ip: chararray);
nlogin = FOREACH login GENERATE FLATTEN(STRSPLIT(serverAL,'[\\\\]')),data, hora,FLATTEN(STRSPLIT(ip,'[\\\\]'));
nlogin = FOREACH nlogin GENERATE $1 AS al:chararray, $2 AS data:chararray, $3 AS hora:chararray, $4 AS ipCliente:chararray;
logout = LOAD '/tmp/logout.txt' USING PigStorage(' ') AS (data: chararray, hora: chararray, logout: chararray, ipAl: chararray, disconec: chararray);
nlogout = FOREACH logout GENERATE data, hora, FLATTEN(STRSPLIT(ipAl,'[\\\\]'));
nlogout = FOREACH nlogout GENERATE $0 AS data:chararray,$1 AS hora:chararray,$2 AS ipCliente:chararray, $3 AS al:chararray;
data = JOIN nlogin BY (al,ipCliente,data), nlogout BY (al,ipCliente,data);
ndata = FOREACH data GENERATE nlogin::al,ToUnixTime(ToDate(CONCAT(nlogin::data, nlogin::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogin:int,ToUnixTime(ToDate(CONCAT(nlogout::data, nlogout::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogout:int,nlogout::ipCliente;
BB = FOREACH ndata GENERATE $0 AS al:chararray, (int)$1 AS tslogin:int, (int)$2 AS tslogout:int, $3 AS ipCliente:chararray;
CC = JOIN BB BY ipCliente, connect BY ipCliente;
DD = FOREACH CC GENERATE BB::al AS al:chararray, (int)BB::tslogin AS tslogin:int, (int)BB::tslogout AS tslogout:int,(int)connect::timeStamp AS timeStamp:int, connect::ipCliente AS ipCliente:chararray, connect::url AS url:chararray;
EE = FILTER DD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);
STORE EE INTO 'EEs';
But it is returning the following error
2015-10-16 21:58:10,600 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-10-16 21:58:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201510162141_0008 has failed! Stop running all dependent jobs
2015-10-16 21:58:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-10-16 21:58:10,667 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 0: Error while executing ForEach at [DD[93,5]]
2015-10-16 21:58:10,667 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-10-16 21:58:10,667 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.2.1 0.12.1 root 2015-10-16 21:56:48 2015-10-16 21:58:10 HASH_JOIN,FILTER
Some jobs have failed! Stop running all dependent jobs
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_201510162141_0007 2 1 4 3 4 4 9 9 9 9 BB,data,login,logout,ndata,nlogin,nlogout HASH_JOIN
Failed Jobs:
JobId Alias Feature Message Outputs
job_201510162141_0008 CC,DD,EE,connect,nsquid,squid HASH_JOIN Message: Job failed! Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201510162141_0008_r_000000 hdfs://localhost:9000/user/root/EEb,
Input(s):
Successfully read 7367 records from: "/tmp/login.txt"
Successfully read 7374 records from: "/tmp/logout.txt"
Failed to read data from "/tmp/squid.txt"
Output(s):
Failed to produce result in "hdfs://localhost:9000/user/root/EEb"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201510162141_0007 -> job_201510162141_0008,
job_201510162141_0008
2015-10-16 21:58:10,674 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 11 time(s).
2015-10-16 21:58:10,674 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
I had created an alternative that worked, just replaced the penultimate line by:
STORE DD INTO 'DD';
newDD = LOAD 'hdfs://localhost:9000/user/root/DD' USING PigStorage AS (al:chararray, tslogin:int, tslogout:int, timeStamp:int, ipCliente:chararray, url:chararray);
EE = FILTER newDD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);
Does anyone have any idea how to fix it without the “store” ?

In Pig latin, am not able to load data as multiple tuples, please advice

I am not able load the data as multiple tuples, am not sure what mistake am doing, please advise.
data.txt
vineet 1 pass Govt
hisham 2 pass Prvt
raj 3 fail Prvt
I want to load them as 2 touples.
A = LOAD 'data.txt' USING PigStorage('\t') AS (T1:tuple(name:bytearray, no:int), T2:tuple(result:chararray, school:chararray));
OR
A = LOAD 'data.txt' USING PigStorage('\t') AS (T1:(name:bytearray, no:int), T2:(result:chararray, school:chararray));
dump A;
the below data is displayed in the form of new line, i dont know why am not able to read actual data from data.txt.
(,)
(,)
(,)
As the input data is not stored as tuple we wont be able to read it directly in to a tuple.
One feasible approach is to read the data and then form a tuple with required fields.
Pig Script :
A = LOAD 'a.csv' USING PigStorage('\t') AS (name:chararray,no:int,result:chararray,school:chararray);
B = FOREACH A GENERATE (name,no) AS T1:tuple(name:chararray, no:int), (result,school) AS T2:tuple(result:chararray, school:chararray);
DUMP B;
Input : a.csv
vineet 1 pass Govt
hisham 2 pass Prvt
raj 3 fail Prvt
Output : DUMP B:
((vineet,1),(pass,Govt))
((hisham,2),(pass,Prvt))
((raj,3),(fail,Prvt))
Output : DESCRIBE B :
B: {T1: (name: chararray,no: int),T2: (result: chararray,school: chararray)}

Unable to read data from Bag in Pig Script/

Can you please let me know if I have {',') delimited field and having Bag Data ,How to read. I am getting below error.
Input Data.
Jorge Posada Yankees|{(Catcher),(Designated_hitter)}|[games#1594,hit_by_pitch#65,grand_slams#7]
Landon Powell Oakland|{(Catcher),(First_baseman)}|[on_base_percentage#0.297,games#26,home_runs#7]
Martin Prado Atlanta|{(Second_baseman),(Infielder),(Left_fielder)},[games#258,hit_by_pitch#3]
bfile= LOAD '/home/cloudera/basketball.txt' using PigStorage('|')as(name:chararray,team:chararray,pos:bag{t:(p:chararray)},bat:map[]);
grunt> players = load 'basketball.txt' using PigStorage('|')as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
2014-11-13 04:49:48,144 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 27, column 117> mismatched input ';' expecting RIGHT_PAREN
Details at logfile: /home/cloudera/pig_1415835089181.log
Sanjeeb
For the above input, regex is not required, you can access all the values using the existing schema itself.
input.txt
Jorge Posada |Yankees|{(Catcher),(Designated_hitter)}|[games#1594,hit_by_pitch#65,grand_slams#7]
Landon Powell |Oakland|{(Catcher),(First_baseman)}|[on_base_percentage#0.297,games#26,home_runs#7]
Martin Prado |Atlanta|{(Second_baseman),(Infielder),(Left_fielder)}|[games#258,hit_by_pitch#3]
Pigscript:
bfile= LOAD 'input.txt' using PigStorage('|') as (name:chararray,team:chararray,pos:bag{t:(p:chararray)},bat:map[]);
--Print the name and team
B = FOREACH bfile GENERATE name,team;
--DUMP B;
--Print the player and his position
C = FOREACH bfile GENERATE name,pos.(p);
--DUMP C;
--Print the player and key/value of games and hit_by_pitch
D = FOREACH bfile GENERATE name,bat#'games',bat#'hit_by_pitch';
--DUMP D;
Output of DUMP B:
(Jorge Posada ,Yankees)
(Landon Powell ,Oakland)
(Martin Prado ,Atlanta)
Output Of DUMP C:
(Jorge Posada ,{(Catcher),(Designated_hitter)})
(Landon Powell ,{(Catcher),(First_baseman)})
(Martin Prado ,{(Second_baseman),(Infielder),(Left_fielder)})
Output Of DUMP D:
(Jorge Posada ,1594,65)
(Landon Powell ,26,)
(Martin Prado ,258,3)
In the bag, if you need multiple fields then declare and access like this
pos:bag{t:(p:chararray,q:charrarray)}
FOREACH bfile GENERATE name,pos.(p,q);

pig latin - not showing the right record numbers

I have written a pig script for wordcount which works fine. I could see the results from pig script in my output directory in hdfs. But towards the end of my console, I see the following:
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_local1695568121_0002 1 1 0 0 0 0 0 0 0 0 words_sorted SAMPLER
job_local2103470491_0003 1 1 0 0 0 0 0 0 0 0 words_sorted ORDER_BY /output/result_pig,
job_local696057848_0001 1 1 0 0 0 0 0 0 0 0 book,words,words_agg,words_grouped GROUP_BY,COMBINER
Input(s):
Successfully read 0 records from: "/data/pg5000.txt"
Output(s):
Successfully stored 0 records in: "/output/result_pig"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local696057848_0001 -> job_local1695568121_0002,
job_local1695568121_0002 -> job_local2103470491_0003,
job_local2103470491_0003
2014-07-01 14:10:35,241 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
As you can see, the job is success. but not the Input(s) and output(s). Both of the them say successfully read/stored 0 records and the counter values are all 0.
why the value is zero. These should not be zero.
I am using hadoop2.2 and pig-0.12
Here is the script:
book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray);
words = foreach book generate FLATTEN(TOKENIZE(lines)) as word;
words_grouped = group words by word;
words_agg = foreach words_grouped generate group as word, COUNT(words);
words_sorted = ORDER words_agg BY $1 DESC;
STORE words_sorted into '/output/result_pig' using PigStorage(':','-schema');
NOTE: my data is present in /data/pg5000.txt and not in default directory which is /usr/name/data/pg5000.txt
EDIT: here is the output of printing my file to console
hadoop fs -cat /data/pg5000.txt | head -10
The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete
by Leonardo Da Vinci
(#3 in our series by Leonardo Da Vinci)
Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.
This header should be the first thing seen when viewing this Project
Gutenberg file. Please do not remove it. Do not change or edit the
cat: Unable to write to output stream.
Please correct the following line
book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray);
to
book = load '/data/pg5000.txt' using PigStorage(',') as (lines:chararray);
I am assuming the delimiter as comma here use the one which is used to separate the records in your file. This will solve the issue
Also note --
If no argument is provided, PigStorage will assume tab-delimited format. If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter.