Referencing field in nested tuple in PIG; - apache-pig

I have been stuck on this for several hours and I cannot figure out what I am doing wrong.
I have a relation "grouped" with the schema of
grouped: {seedword: chararray,baggy: {outertup: (groupy: (seedword: chararray,coword: chararray))}}
A sample of what the relation looks like is:
(auto,{((auto,car)),((auto,truck))})
I need to generate just the seedword and a tuple of cowords. In my example I would want
(auto, (car, truck)).
I have tried:
FOREACH grouped GENERATE baggy::outertup.groupy.coword;
FOREACH grouped GENERATE baggy.outertup.groupy.coword;
FOREACH grouped GENERATE baggy.groupy.coword;
and none of these work, and give me error messages saying there is no such field. Please help! !!
HEre's some more of my code:
keywords = LOAD 'merged' USING as ( seedword:chararray, doc:chararray);
---COUNT HOW MANY DOCUMENTS EACH WORD IS IN
group_by_seedword = GROUP keywords BY $0;
invert_index = FOREACH group_by_seedword GENERATE $0 as seedword:chararray, keywords.$1;
word_doc_count= FOREACH invert_index GENERATE seedword, COUNT($1);
-- map words to document
words_in_doc= GROUP keywords BY doc;
word_docs = FOREACH words_in_doc GENERATE group AS doc, keywords.seedword;
--(document:(keyword, keyword, keyword...))
--map words to their cowords in doc
temp_join = JOIN keywords BY doc,word_docs BY doc;
--DUMP temp_join;
cowords_by_doc = FOREACH temp_join GENERATE $0 as seedword:chararray, $3 as cowords;
cowords_interm= FOREACH cowords_by_doc GENERATE seedword, FLATTEN(cowords);
cowords = FILTER cowords_interm BY (seedword!=$1);---GETS RID OF SINGLE DOC WORD;
temp_join_count1 = JOIN cowords BY $0, word_doc_count BY seedword;
-- GETS WORDS THAT OCCURE BY THEMSELVES IN A SINGLE DOCUMENT
G = JOIN cowords_interm BY $0 LEFT OUTER, cowords by $0;
orph_word = FILTER G BY $2 is null;
orph_word_count = FOREACH orph_word GENERATE $0,null, 0;
temp_join_count= UNION temp_join_count1, orph_word_count;
inter_frac = FOREACH temp_join_count GENERATE $0 as seedword:chararray, $1 as coword:chararray, 1.0/$3 as frac:double;
inter_frac_combine = GROUP inter_frac BY (seedword, coword);
inter_frac_sum = FOREACH inter_frac_combine GENERATE $0 , SUM(inter_frac.frac) as frac:double;
filtered = FILTER inter_frac_sum BY ($1 >=$relatedness_ratio);
grouped= GROUP filtered by $0.seedword;
g = FOREACH grouped GENERATE group as seedword:chararray, filtered.$0;
named = FOREACH g GENERATE $0 as seedword:chararray, $1 as baggy:bag{(outertup:tuple(groupy:tuple(seedword:chararray, coword:chararray)))};
the input file you can try should be like this:
car doc1.txt
auto doc1.txt
bunny doc2.txt
ball doc2.txt
toy car doc2.txt
random doc3.txt
plane doc3.txt

I'd had a similar issue where I couldn't reference inner tuples.
My solution was to flatten the data and then some more filtering and grouping.
Cheers
V

Related

How to join and filter 2 relations by columns in Pig?

I'm a novice to Pig script, and trying to modify some existing pig script to extract some data from log files.
E.g. I have 2 log files, one with the schema as:
message Class {
message Student {
optional int32 uid = 1;
optional string name = 2;
}
optional int32 cid = 1;
repeated Student students = 2;
}
After loading, I think a bag (say, bag1) is created (correct me if I'm wrong):
bag1:
{
(uid1, {(cid11, name11), (cid12, name12), (cid13, name13), ...}),
(uid2, {(cid21, name21), (cid22, name22), (cid23, name23), ...}),
...
}
And another log file is simple, the resulting bag (bag2) is like this.
bag2:
{
(name11),
(name13),
(name22),
...
}
What I want is, get all the rows from bag1 if any name in bag2 is contained inside the row, like:
result bag:
{
(uid1, (name11, name13)),
(uid2, (name22)),
}
I think I'll need to do some join/filter on these 2 bags, but don't know how.
I tried a script snippet like below, but it's even not a valid script.
res = FOREACH bag1 {
names = FOREACH students GENERATE name;
xnames = JOIN names by name, bag2 by name;
GENERATE cid, xnames;
};
FILTER res BY not IsEmpty(xnames);
So could anyone pls. give me some help on the script?
You won't be able to use JOIN inside a nested FOREACH, you can try flattening your tuple and then join it with the second table:
bag1_flat = FOREACH bag1 GENERATE $0 AS uid, FLATTEN($1);
bag1_flat = FOREACH bag1_flat GENERATE uid, $2 AS name;
An inner join, will filter the lines :
bag12 = JOIN bag1_flat by name, bag2 by $0;
bag12 = FOREACH bag12 GENERATE bag1_flat::uid AS uid, bag1_flat::name AS name;
Finally, group by uid you won't get tuples though as they cannot be different sizes, you'll get bags:
bag12_group = GROUP bag12 BY uid;
res = FOREACH bag12_group GENERATE group AS uid, bag12.name AS names;

PIG - STORING TEMPORARY VALUES

Data schema : sdesc:chararray,samt:init,syear:chararrary,stype:chararrary
Data:
Wrench 259000 2000 store
Wrench 135000 2000 online
Wrench 175000 2001 online
Wrench 180000 2001 store
Script
ysales =LOAD ‘salesdata.txt’ using PigStorage()as (sdesc:chararray,samt:init,syear:chararrary,stype:chararrary);
basedata = FILTER ysales by (sdesc==’Wrench’) and (syear = ‘2000’ ) and (stype = ‘store);
my result set is : DUMP basedata;
(Wrench,259000,2000,store)
So the question is how do I break up basedata to have (for example) A = ‘Wrench’ B = 259000, C=2000, D = ‘store’
You can use argument numbers to extract values according to columns
a = foreach basedata generate $0;
b = foreach basedata generate $1;
c = foreach basedata generate $2;
d = foreach basedata generate $3;
data = load '/home/satish/wrench' using PigStorage(' ') as (name,total,year,type) ;
//if you want to use you can use filter
reqdata = foreach data generate CONCAT('A','=',name) as A, CONCAT('B','=',total) as B, CONCAT('C','=',year) as C,CONCAT('D','=',type) as D;
dump reqdata;
(A=Wrench,B=259000,C=2000,D=store)
(A=Wrench,B=135000,C=2000,D=online)
(A=Wrench,B=175000,C=2001,D=online)
(A=Wrench,B=180000,C=2001,D=store)
fdata = foreach reqdata generate A,B;
dump fdata
(A=Wrench,B=259000)
(A=Wrench,B=135000)
(A=Wrench,B=175000)
(A=Wrench,B=180000)
\if you want to remove tuples use FLATTEN

Identifying columns through PiG

I have data set like below :
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
What separator I should be using in this case to separate out above 3 columns.
First column value is => Column,1A
Second column value is => Column2A
Third column value is => Column3A
Let be try my code:
a = LOAD '/home/hduser/pig_ex' USING PigStorage(',') AS (col1,col2,col3,col4);
b = FOREACH a GENERATE REGEX_EXTRACT(col1, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(col2, '^(.*)\\"', 1) AS (modsecondcol),col3,col4;
c = foreach b generate CONCAT($0, CONCAT(', ', $1)), $2 , $3;
dump c;
I am able to resolve it using the below steps:
Input:-
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
PiG Script :-
A = load '/home/hduser/pig_ex' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\,',4)) AS (firstcol:chararray,secondcol:chararray,thirdcol:chararray,forthcol:chararray);
C = FOREACH B GENERATE REGEX_EXTRACT(firstcol, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(secondcol, '^(.*)\\"', 1) AS (modsecondcol),thirdcol,forthcol;
D = FOREACH C GENERATE CONCAT(modfirstcol,',',modsecondcol),thirdcol,forthcol;
DUMP D;
Output :-
(column,1A,column2A,column3A)
(column,1B,column2B,column3B)
(column,1C,column2C,column3C)
(column,1D,column2D,column3D)
Please let me know if there is any better way

PIG: FLATTEN error

I have a relation X with structure X: {group: chararray,inboundCount: {(name: chararray,inb: long)},outboundCount: {(name: chararray,out: long)}}as follows:
(IAD,{},{(IAD,25)})
(LAX,{},{(LAX,2)})
(ORD,{(ORD,27)},{})
(PDX,{},{(PDX,3)})
(SFO,{(SFO,3)},{})
I want an output with the following structure final: {airport: chararray,inbound: long,outbound: long}with out put:
(IAD,,25)
(LAX,,2)
(ORD,27,)
(PDX,,3)
(SFO,3,)
I've tried the following code and it gives the output structure that I want. But nothing get printed. Is it because of the null value bags?.
final = foreach X generate group as airport,FLATTEN(inboundCount.inb) as inbound,FLATTEN(outboundCount.out) as outbound;
Please help me.
EDIT
I got this relation x by executing the following commands.
A= load '/user/hduser/airline.csv' using PigStorage(',') as (year:int,month:int,dayofmonth:int,dayofweek:int,dep:int,CRS:int,Arr:int,CRSArr:int,UniqueCarrier:chararray,FlightNum:int,TailNum:chararray,ActualElapsedTime:int,CRSElapsed:int,AirTime:int,ArrDelay:int,DepDelay:int,Origin:chararray,Dest:chararray,Distance:int,TaxiIn:int,TaxiOut:int,Cancelled:int,CancelCode:chararray,Diverted:int,CarrierDelay:int,WeatherDelay:int,NASDelay:int,SecurityDelay:int,LateAircraft:int);
B= foreach A generate year,month,UniqueCarrier,FlightNum,TailNum,Origin,Dest;
inbound = group B by Dest;
inboundCount = foreach inbound generate group,COUNT(B.FlightNum) as inb;
outbound = group B by Origin;
outboundCount = foreach outbound generate group,COUNT(B.FlightNum) as out;
X = COGROUP inboundCount BY name, outboundCount BY name;
Sample input record:
2008,1,31,4,1757,1155,2400,1758,UA,114,N845UA,243,243,217,362,362,LAX,ORD,1745,11,15,0,,0,0,0,362,0,0
You are almost there.Pls try this .just apply SUM instead of flatten
A= load '/user/hduser/airline.csv' using PigStorage(',') as (year:int,month:int,dayofmonth:int,dayofweek:int,dep:int,CRS:int,Arr:int,CRSArr:int,UniqueCarrier:chararray,FlightNum:int,TailNum:chararray,ActualElapsedTime:int,CRSElapsed:int,AirTime:int,ArrDelay:int,DepDelay:int,Origin:chararray,Dest:chararray,Distance:int,TaxiIn:int,TaxiOut:int,Cancelled:int,CancelCode:chararray,Diverted:int,CarrierDelay:int,WeatherDelay:int,NASDelay:int,SecurityDelay:int,LateAircraft:int);
B= foreach A generate year,month,UniqueCarrier,FlightNum,TailNum,Origin,Dest;
inbound = group B by Dest;
inboundCount = foreach inbound generate group,COUNT(B.FlightNum) as inb;
outbound = group B by Origin;
outboundCount = foreach outbound generate group,COUNT(B.FlightNum) as out;
X = COGROUP inboundCount BY name, outboundCount BY name;
final_data = FOREACH X GENERATE group as airport, SUM(inboundCount.inb) as inb, SUM(outboundCount.out) as out;
dump final_data;
The dump of final_data will give you the expected result.
(IAD,,25)
(LAX,,2)
(ORD,27,)
(PDX,,3)
(SFO,3,)
If you want then you can still replace the NULL count into 0
final_null_check = FOREACH final_data GENERATE airport,(inb is null ? 0 :inb) as inb_cnt, (out is null ? 0 : out) as out_cnt;
After NULL Check if you dump final_null_check relation the you will get output like below
(IAD,0,25)
(LAX,0,2)
(ORD,27,0)
(PDX,0,3)
(SFO,3,0)

Using Aggregate functions in Pig

My input file is below
a1,1,on,400
a1,2,off,100
a1,3,on,200
I need to add $3 only if $2 is equal to "on".I have written script as below, after that I don't know how to proceed. For adding $3 only I need to apply some filter. for adding $1 there is no filter at all
Can someone help me on finishing this.
myinput = LOAD 'file' USING PigStorage(',') AS(id:chararray,flag:chararray,amt:int)
grouped = GROUP myinput BY id
I need output as below
a1, 6,600
Here is a possible solution,
You could do something like this (not tested) :
myinput = LOAD 'file' USING PigStorage(',');
A = FOREACH myinput GENERATE $0 as id, $1 as first_sum, (($2 == 'on') ? $3 : 0) as second_sum;
grouped = GROUP A BY id;
RESULT = FOREACH grouped GENERATE group as id, SUM($1.first_sum), SUM($1.second_sum);
That should do the trick
Try this
myinput = LOAD '/home/gopalkrishna/PIGPRAC/pig-sum.txt' using PigStorage(',') as (name:chararray,num:int,stat:chararray,amt:int);
A = GROUP myinput BY name;
B = FOREACH A GENERATE group, SUM(myinput.num),SUM(myinput.amt);
STORE B INTO 'SUMOUT';