Pig one row to multiple row - apache-pig

Can you please provide Pig script for below query?
here's input format.
Input
ID, Label
122,a|b
215,q|b|c
214,Z|b|c
218,w|b|c
211,r|b|c
219,u|b
Expected output
122,a
122,b
215,q
215,b
215,c
214,Z
214,b
214,c
218,w
218,b
218,c
...........
Thanks,
Abhi

TOKENIZE the Label, it will give a bag and than FLATTEN it, which will give you as many rows as are tuples in the bag. Sample code
inpt = LOAD '....' USING PigStorage(',') AS (ID: chararray, Label : chararray);
result = FOREACH inpt GENERATE ID, FLATTEN(TOKENIZE(Lable, '|'));
DUMP result;

Related

Pig script to concatenate the values in the tuples

Input:
(11111111,{(A,MARK,APPLE,ABC1,11111111),(B,PAUL,AMAZON,ABC2,11111111),(C,TIM,FIVN,ABC3,11111111),(D,LIN,MULESFT,ABC4,11111111),(E,YEP,UHG,ABC5,11111111),(F,QIN,ATT,ABC6,11111111)})
(22222222,{(A,MARK,APPLE,ABC6,22222222),(B,MARK,AMAZON,ABC7,22222222),(C,MARK,PQE,ABC8,22222222),(D,MARK,AMB,ABC9,22222222),(E,MARK,YZQ,ABC19,22222222),(F,MARK,PQR,,22222222)})
I have grouped the data with the key as above. I should generate the output by concatenating all the values of the tuple including nulls as below:
Output:
(1111111,A^B^C^D^E^F,MARK^PAUL^TIM^LIN^YEP^QIN,APPLE^AMAZON^FIVN^MULESFT^UHG^ATT,ABC1^ABC2^ABC3^ABC4^ABC5^^ABC6)
(2222222,A^B^^D^E^G,TIM^AIN^TIM^BIN^CIN^DIN^RIN,APPLE^AMAZON^PQE^AMB^YZQ^RIN,ABC6^ABC7^ABC8^ABC9^ABC19^^)
Can some one please help me?
Sharing a code snippet which may help, work on this to achieve the expected output.
Input :
1,A
1,B
1,C
2,D
2,E
2,F
Output :
(1,C^B^A)
(2,F^E^D)
Pig Snippet :
data1 = load '/Users/muralirao/learning/pig/a.csv' using PigStorage(',') as (id:int, name:chararray);
req_data = FOREACH (GROUP data1 BY id) {
names = data1.name;
GENERATE group AS id, BagToString(names,'^');
};
DUMP req_data;

Convert the value of a column to uppercase in pig

I need to convert the value of a column to uppercase in pig.
Was able to do using UPPER but this creates a new column.
For example:
A = Load 'MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
Dump A;
Returns
a,b,c
d,e,f
Now I need to convert second column to upper case.
B = Foreach A generate *,UPPER(column2);
Dump B;
returns
a,b,c,B
e,f,g,F
But I need
a,B,c
e,F,g
Please let me know if there is a way to so.
I didn't tried from my side but you can try like this
B = Foreach A generate column1,UPPER(column2),column3;
Using the "*" in the below line is the reason for the extra column:
B = FOREACH A generate *, UPPER(column2);
Instead use the below:
B = Foreach A generate column1, UPPER(column2), column3;
You can do it with user define function that default provided by Apache pig
find PiggyBank Jar
command
find / -name "piggybank*.jar*"
now goto pig grunt shell
code
grunt> register /usr/local/pig-0.16.0/contrib/piggybank/java/piggybank.jar;
grunt> A = Load 'data/MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
grunt> dump A;
result
(a,b,c)
(d,e,f)
Now convert second column to upper case.
grunt> B = foreach A generate column1,org.apache.pig.piggybank.evaluation.string.UPPER(column2),column3;
grunt> dump B;
result
(a,B,c)
(d,E,f)

Pig : Adding new column to existing inner Tuple in Pig

I want to add new column to existing tuple column in Pig.
Example:Input Schema:
name: chararray,
attribute_list: {innertuple: (height: int,length: int,size: chararray)}
Output Schema:
Using generate statement I want to add new column in tuple which will hold the same value as length but with some other name.
name: chararray,
attribute_list: {innertuple: (height: int,length: int,size: chararray, len : int)}
I tried below approach but its not working:
op = Foreach input_data generate
name,
attribute_list as attr : {(
height,
length,
size,
length as len)};
Please suggest.
Thanks in advance
Option 1:
Add a rank to each row, flatten attribute_list bag then recreate bag with additional columns.
--Rank input_schema(ip) using rank function:
ranked= rank ip;
-- flatten each value of bag.tuple to row level
a= foreach ranked generate rank_ip as id, flatten(attribute_list.$0), flatten(attribute_list.$0.length) as len;
b= group a by id;
op= foreach b generate flatten($1.name) as name, $1 as attr;
--The name also will be part of attr bag.
Option 2:
a. The DataFu has a pig udf to concat multiple tuple around bag.
b. Create UDF BagConcat.
define BagConcat datafu.pig.bags.BagConcat();
c. Flatten elements:
a= foreach ip generate name, flatten(attribute_list.$0), flatten(attribute_list.$0.length) as len;
d. Reproject your bag:
op= foreach a generate name, BagConcat(height,len,size,len) as attr;
A = LOAD 'PATH' USING PigStorage() AS (ID:INT);
B = FOREACH sourcenew GENERATE *, null as len:int;
You can also give any integer value in place of null.

How to make multiple rows from single row in pig for movie lens dataset

I want to divide a single row into multiple row on the basis of a field in pig.
Example:
Consider one of the row in movie Data Set as follows:
(31807, Dot the I (2003), Drama|Film-Noir|Thriller)
each field is separated by ','.
Desired Output is as follows in 3 different rows:
31807,Dot the I (2003),Drama
31807,Dot the I (2003),Film-Noir
31807,Dot the I (2003),Thriller
Can anyone please help me to get the desired output in pig.
The below logic will help you .
/* Input
(31807,Dot the I (2003),Drama|Film-Noir|Thriller)
*/
list = LOAD '/user/cloudera/movies.txt' USING PigStorage(',') AS(id:int,name:chararray,generes:chararray);
list_each = FOREACH list GENERATE id,name, flatten(TOKENIZE(generes,'|'));
dump list_each;
/* Output
(31807,Dot the I (2003),Drama)
(31807,Dot the I (2003),Film-Noir)
(31807,Dot the I (2003),Thriller)
*/

PIG script for creating IDxCITY matrix from given csv file

I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0
You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');