Identifying columns through PiG - apache-pig

I have data set like below :
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
What separator I should be using in this case to separate out above 3 columns.
First column value is => Column,1A
Second column value is => Column2A
Third column value is => Column3A

Let be try my code:
a = LOAD '/home/hduser/pig_ex' USING PigStorage(',') AS (col1,col2,col3,col4);
b = FOREACH a GENERATE REGEX_EXTRACT(col1, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(col2, '^(.*)\\"', 1) AS (modsecondcol),col3,col4;
c = foreach b generate CONCAT($0, CONCAT(', ', $1)), $2 , $3;
dump c;

I am able to resolve it using the below steps:
Input:-
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
PiG Script :-
A = load '/home/hduser/pig_ex' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\,',4)) AS (firstcol:chararray,secondcol:chararray,thirdcol:chararray,forthcol:chararray);
C = FOREACH B GENERATE REGEX_EXTRACT(firstcol, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(secondcol, '^(.*)\\"', 1) AS (modsecondcol),thirdcol,forthcol;
D = FOREACH C GENERATE CONCAT(modfirstcol,',',modsecondcol),thirdcol,forthcol;
DUMP D;
Output :-
(column,1A,column2A,column3A)
(column,1B,column2B,column3B)
(column,1C,column2C,column3C)
(column,1D,column2D,column3D)
Please let me know if there is any better way

Related

PIG - STORING TEMPORARY VALUES

Data schema : sdesc:chararray,samt:init,syear:chararrary,stype:chararrary
Data:
Wrench 259000 2000 store
Wrench 135000 2000 online
Wrench 175000 2001 online
Wrench 180000 2001 store
Script
ysales =LOAD ‘salesdata.txt’ using PigStorage()as (sdesc:chararray,samt:init,syear:chararrary,stype:chararrary);
basedata = FILTER ysales by (sdesc==’Wrench’) and (syear = ‘2000’ ) and (stype = ‘store);
my result set is : DUMP basedata;
(Wrench,259000,2000,store)
So the question is how do I break up basedata to have (for example) A = ‘Wrench’ B = 259000, C=2000, D = ‘store’
You can use argument numbers to extract values according to columns
a = foreach basedata generate $0;
b = foreach basedata generate $1;
c = foreach basedata generate $2;
d = foreach basedata generate $3;
data = load '/home/satish/wrench' using PigStorage(' ') as (name,total,year,type) ;
//if you want to use you can use filter
reqdata = foreach data generate CONCAT('A','=',name) as A, CONCAT('B','=',total) as B, CONCAT('C','=',year) as C,CONCAT('D','=',type) as D;
dump reqdata;
(A=Wrench,B=259000,C=2000,D=store)
(A=Wrench,B=135000,C=2000,D=online)
(A=Wrench,B=175000,C=2001,D=online)
(A=Wrench,B=180000,C=2001,D=store)
fdata = foreach reqdata generate A,B;
dump fdata
(A=Wrench,B=259000)
(A=Wrench,B=135000)
(A=Wrench,B=175000)
(A=Wrench,B=180000)
\if you want to remove tuples use FLATTEN

PIG: FLATTEN error

I have a relation X with structure X: {group: chararray,inboundCount: {(name: chararray,inb: long)},outboundCount: {(name: chararray,out: long)}}as follows:
(IAD,{},{(IAD,25)})
(LAX,{},{(LAX,2)})
(ORD,{(ORD,27)},{})
(PDX,{},{(PDX,3)})
(SFO,{(SFO,3)},{})
I want an output with the following structure final: {airport: chararray,inbound: long,outbound: long}with out put:
(IAD,,25)
(LAX,,2)
(ORD,27,)
(PDX,,3)
(SFO,3,)
I've tried the following code and it gives the output structure that I want. But nothing get printed. Is it because of the null value bags?.
final = foreach X generate group as airport,FLATTEN(inboundCount.inb) as inbound,FLATTEN(outboundCount.out) as outbound;
Please help me.
EDIT
I got this relation x by executing the following commands.
A= load '/user/hduser/airline.csv' using PigStorage(',') as (year:int,month:int,dayofmonth:int,dayofweek:int,dep:int,CRS:int,Arr:int,CRSArr:int,UniqueCarrier:chararray,FlightNum:int,TailNum:chararray,ActualElapsedTime:int,CRSElapsed:int,AirTime:int,ArrDelay:int,DepDelay:int,Origin:chararray,Dest:chararray,Distance:int,TaxiIn:int,TaxiOut:int,Cancelled:int,CancelCode:chararray,Diverted:int,CarrierDelay:int,WeatherDelay:int,NASDelay:int,SecurityDelay:int,LateAircraft:int);
B= foreach A generate year,month,UniqueCarrier,FlightNum,TailNum,Origin,Dest;
inbound = group B by Dest;
inboundCount = foreach inbound generate group,COUNT(B.FlightNum) as inb;
outbound = group B by Origin;
outboundCount = foreach outbound generate group,COUNT(B.FlightNum) as out;
X = COGROUP inboundCount BY name, outboundCount BY name;
Sample input record:
2008,1,31,4,1757,1155,2400,1758,UA,114,N845UA,243,243,217,362,362,LAX,ORD,1745,11,15,0,,0,0,0,362,0,0
You are almost there.Pls try this .just apply SUM instead of flatten
A= load '/user/hduser/airline.csv' using PigStorage(',') as (year:int,month:int,dayofmonth:int,dayofweek:int,dep:int,CRS:int,Arr:int,CRSArr:int,UniqueCarrier:chararray,FlightNum:int,TailNum:chararray,ActualElapsedTime:int,CRSElapsed:int,AirTime:int,ArrDelay:int,DepDelay:int,Origin:chararray,Dest:chararray,Distance:int,TaxiIn:int,TaxiOut:int,Cancelled:int,CancelCode:chararray,Diverted:int,CarrierDelay:int,WeatherDelay:int,NASDelay:int,SecurityDelay:int,LateAircraft:int);
B= foreach A generate year,month,UniqueCarrier,FlightNum,TailNum,Origin,Dest;
inbound = group B by Dest;
inboundCount = foreach inbound generate group,COUNT(B.FlightNum) as inb;
outbound = group B by Origin;
outboundCount = foreach outbound generate group,COUNT(B.FlightNum) as out;
X = COGROUP inboundCount BY name, outboundCount BY name;
final_data = FOREACH X GENERATE group as airport, SUM(inboundCount.inb) as inb, SUM(outboundCount.out) as out;
dump final_data;
The dump of final_data will give you the expected result.
(IAD,,25)
(LAX,,2)
(ORD,27,)
(PDX,,3)
(SFO,3,)
If you want then you can still replace the NULL count into 0
final_null_check = FOREACH final_data GENERATE airport,(inb is null ? 0 :inb) as inb_cnt, (out is null ? 0 : out) as out_cnt;
After NULL Check if you dump final_null_check relation the you will get output like below
(IAD,0,25)
(LAX,0,2)
(ORD,27,0)
(PDX,0,3)
(SFO,3,0)

Regex to extract first part of string in Apache Pig

I need to extract post code district from the input data below
AB55 4
DD7 6LL
DD5 2HI
My Code
A = load 'data' as postcode:chararray;
B = foreach A {
code_district = REGEX_EXTRACT(postcode,'<SOME EXP>',1);
generate code_district;
};
dump B;
Output should look like
AB55
DD7
DD5
what should be the regular expression to extract the first part of the string?
Can you try the below Regex?
Option1:
A = LOAD 'input' as postcode:chararray;
code_district = FOREACH A GENERATE REGEX_EXTRACT(postcode,'(\\w+).*',1);
DUMP code_district;
Option2:
A = LOAD 'input' as postcode:chararray;
code_district = FOREACH A GENERATE REGEX_EXTRACT(postcode,'([a-zA-Z0-9]+).*',1);
DUMP code_district;
Output:
(AB55)
(DD7)
(DD5)

Using Aggregate functions in Pig

My input file is below
a1,1,on,400
a1,2,off,100
a1,3,on,200
I need to add $3 only if $2 is equal to "on".I have written script as below, after that I don't know how to proceed. For adding $3 only I need to apply some filter. for adding $1 there is no filter at all
Can someone help me on finishing this.
myinput = LOAD 'file' USING PigStorage(',') AS(id:chararray,flag:chararray,amt:int)
grouped = GROUP myinput BY id
I need output as below
a1, 6,600
Here is a possible solution,
You could do something like this (not tested) :
myinput = LOAD 'file' USING PigStorage(',');
A = FOREACH myinput GENERATE $0 as id, $1 as first_sum, (($2 == 'on') ? $3 : 0) as second_sum;
grouped = GROUP A BY id;
RESULT = FOREACH grouped GENERATE group as id, SUM($1.first_sum), SUM($1.second_sum);
That should do the trick
Try this
myinput = LOAD '/home/gopalkrishna/PIGPRAC/pig-sum.txt' using PigStorage(',') as (name:chararray,num:int,stat:chararray,amt:int);
A = GROUP myinput BY name;
B = FOREACH A GENERATE group, SUM(myinput.num),SUM(myinput.amt);
STORE B INTO 'SUMOUT';

Referencing field in nested tuple in PIG;

I have been stuck on this for several hours and I cannot figure out what I am doing wrong.
I have a relation "grouped" with the schema of
grouped: {seedword: chararray,baggy: {outertup: (groupy: (seedword: chararray,coword: chararray))}}
A sample of what the relation looks like is:
(auto,{((auto,car)),((auto,truck))})
I need to generate just the seedword and a tuple of cowords. In my example I would want
(auto, (car, truck)).
I have tried:
FOREACH grouped GENERATE baggy::outertup.groupy.coword;
FOREACH grouped GENERATE baggy.outertup.groupy.coword;
FOREACH grouped GENERATE baggy.groupy.coword;
and none of these work, and give me error messages saying there is no such field. Please help! !!
HEre's some more of my code:
keywords = LOAD 'merged' USING as ( seedword:chararray, doc:chararray);
---COUNT HOW MANY DOCUMENTS EACH WORD IS IN
group_by_seedword = GROUP keywords BY $0;
invert_index = FOREACH group_by_seedword GENERATE $0 as seedword:chararray, keywords.$1;
word_doc_count= FOREACH invert_index GENERATE seedword, COUNT($1);
-- map words to document
words_in_doc= GROUP keywords BY doc;
word_docs = FOREACH words_in_doc GENERATE group AS doc, keywords.seedword;
--(document:(keyword, keyword, keyword...))
--map words to their cowords in doc
temp_join = JOIN keywords BY doc,word_docs BY doc;
--DUMP temp_join;
cowords_by_doc = FOREACH temp_join GENERATE $0 as seedword:chararray, $3 as cowords;
cowords_interm= FOREACH cowords_by_doc GENERATE seedword, FLATTEN(cowords);
cowords = FILTER cowords_interm BY (seedword!=$1);---GETS RID OF SINGLE DOC WORD;
temp_join_count1 = JOIN cowords BY $0, word_doc_count BY seedword;
-- GETS WORDS THAT OCCURE BY THEMSELVES IN A SINGLE DOCUMENT
G = JOIN cowords_interm BY $0 LEFT OUTER, cowords by $0;
orph_word = FILTER G BY $2 is null;
orph_word_count = FOREACH orph_word GENERATE $0,null, 0;
temp_join_count= UNION temp_join_count1, orph_word_count;
inter_frac = FOREACH temp_join_count GENERATE $0 as seedword:chararray, $1 as coword:chararray, 1.0/$3 as frac:double;
inter_frac_combine = GROUP inter_frac BY (seedword, coword);
inter_frac_sum = FOREACH inter_frac_combine GENERATE $0 , SUM(inter_frac.frac) as frac:double;
filtered = FILTER inter_frac_sum BY ($1 >=$relatedness_ratio);
grouped= GROUP filtered by $0.seedword;
g = FOREACH grouped GENERATE group as seedword:chararray, filtered.$0;
named = FOREACH g GENERATE $0 as seedword:chararray, $1 as baggy:bag{(outertup:tuple(groupy:tuple(seedword:chararray, coword:chararray)))};
the input file you can try should be like this:
car doc1.txt
auto doc1.txt
bunny doc2.txt
ball doc2.txt
toy car doc2.txt
random doc3.txt
plane doc3.txt
I'd had a similar issue where I couldn't reference inner tuples.
My solution was to flatten the data and then some more filtering and grouping.
Cheers
V