Pig one row to multiple row

Pig one row to multiple row - apache-pig

Can you please provide Pig script for below query?
here's input format.
Input
ID, Label
122,a|b
215,q|b|c
214,Z|b|c
218,w|b|c
211,r|b|c
219,u|b
Expected output
122,a
122,b
215,q
215,b
215,c
214,Z
214,b
214,c
218,w
218,b
218,c
...........
Thanks,
Abhi

TOKENIZE the Label, it will give a bag and than FLATTEN it, which will give you as many rows as are tuples in the bag. Sample code
inpt = LOAD '....' USING PigStorage(',') AS (ID: chararray, Label : chararray);
result = FOREACH inpt GENERATE ID, FLATTEN(TOKENIZE(Lable, '|'));
DUMP result;

Related

Pig script to concatenate the values in the tuples

Input:
(11111111,{(A,MARK,APPLE,ABC1,11111111),(B,PAUL,AMAZON,ABC2,11111111),(C,TIM,FIVN,ABC3,11111111),(D,LIN,MULESFT,ABC4,11111111),(E,YEP,UHG,ABC5,11111111),(F,QIN,ATT,ABC6,11111111)})
(22222222,{(A,MARK,APPLE,ABC6,22222222),(B,MARK,AMAZON,ABC7,22222222),(C,MARK,PQE,ABC8,22222222),(D,MARK,AMB,ABC9,22222222),(E,MARK,YZQ,ABC19,22222222),(F,MARK,PQR,,22222222)})
I have grouped the data with the key as above. I should generate the output by concatenating all the values of the tuple including nulls as below:
Output:
(1111111,A^B^C^D^E^F,MARK^PAUL^TIM^LIN^YEP^QIN,APPLE^AMAZON^FIVN^MULESFT^UHG^ATT,ABC1^ABC2^ABC3^ABC4^ABC5^^ABC6)
(2222222,A^B^^D^E^G,TIM^AIN^TIM^BIN^CIN^DIN^RIN,APPLE^AMAZON^PQE^AMB^YZQ^RIN,ABC6^ABC7^ABC8^ABC9^ABC19^^)
Can some one please help me?

Sharing a code snippet which may help, work on this to achieve the expected output.
Input :
1,A
1,B
1,C
2,D
2,E
2,F
Output :
(1,C^B^A)
(2,F^E^D)
Pig Snippet :
data1 = load '/Users/muralirao/learning/pig/a.csv' using PigStorage(',') as (id:int, name:chararray);
req_data = FOREACH (GROUP data1 BY id) {
names = data1.name;
GENERATE group AS id, BagToString(names,'^');
};
DUMP req_data;

Convert the value of a column to uppercase in pig

I need to convert the value of a column to uppercase in pig.
Was able to do using UPPER but this creates a new column.
For example:
A = Load 'MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
Dump A;
Returns
a,b,c
d,e,f
Now I need to convert second column to upper case.
B = Foreach A generate *,UPPER(column2);
Dump B;
returns
a,b,c,B
e,f,g,F
But I need
a,B,c
e,F,g
Please let me know if there is a way to so.

I didn't tried from my side but you can try like this
B = Foreach A generate column1,UPPER(column2),column3;

Using the "*" in the below line is the reason for the extra column:
B = FOREACH A generate *, UPPER(column2);
Instead use the below:
B = Foreach A generate column1, UPPER(column2), column3;

You can do it with user define function that default provided by Apache pig
find PiggyBank Jar
command
find / -name "piggybank*.jar*"
now goto pig grunt shell
code
grunt> register /usr/local/pig-0.16.0/contrib/piggybank/java/piggybank.jar;
grunt> A = Load 'data/MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
grunt> dump A;
result
(a,b,c)
(d,e,f)
Now convert second column to upper case.
grunt> B = foreach A generate column1,org.apache.pig.piggybank.evaluation.string.UPPER(column2),column3;
grunt> dump B;
result
(a,B,c)
(d,E,f)

Pig : Adding new column to existing inner Tuple in Pig

I want to add new column to existing tuple column in Pig.
Example:Input Schema:
name: chararray,
attribute_list: {innertuple: (height: int,length: int,size: chararray)}
Output Schema:
Using generate statement I want to add new column in tuple which will hold the same value as length but with some other name.
name: chararray,
attribute_list: {innertuple: (height: int,length: int,size: chararray, len : int)}
I tried below approach but its not working:
op = Foreach input_data generate
name,
attribute_list as attr : {(
height,
length,
size,
length as len)};
Please suggest.
Thanks in advance

Option 1:
Add a rank to each row, flatten attribute_list bag then recreate bag with additional columns.
--Rank input_schema(ip) using rank function:
ranked= rank ip;
-- flatten each value of bag.tuple to row level
a= foreach ranked generate rank_ip as id, flatten(attribute_list.$0), flatten(attribute_list.$0.length) as len;
b= group a by id;
op= foreach b generate flatten($1.name) as name, $1 as attr;
--The name also will be part of attr bag.
Option 2:
a. The DataFu has a pig udf to concat multiple tuple around bag.
b. Create UDF BagConcat.
define BagConcat datafu.pig.bags.BagConcat();
c. Flatten elements:
a= foreach ip generate name, flatten(attribute_list.$0), flatten(attribute_list.$0.length) as len;
d. Reproject your bag:
op= foreach a generate name, BagConcat(height,len,size,len) as attr;

A = LOAD 'PATH' USING PigStorage() AS (ID:INT);
B = FOREACH sourcenew GENERATE *, null as len:int;
You can also give any integer value in place of null.

How to make multiple rows from single row in pig for movie lens dataset

I want to divide a single row into multiple row on the basis of a field in pig.
Example:
Consider one of the row in movie Data Set as follows:
(31807, Dot the I (2003), Drama|Film-Noir|Thriller)
each field is separated by ','.
Desired Output is as follows in 3 different rows:
31807,Dot the I (2003),Drama
31807,Dot the I (2003),Film-Noir
31807,Dot the I (2003),Thriller
Can anyone please help me to get the desired output in pig.

The below logic will help you .
/* Input
(31807,Dot the I (2003),Drama|Film-Noir|Thriller)
*/
list = LOAD '/user/cloudera/movies.txt' USING PigStorage(',') AS(id:int,name:chararray,generes:chararray);
list_each = FOREACH list GENERATE id,name, flatten(TOKENIZE(generes,'|'));
dump list_each;
/* Output
(31807,Dot the I (2003),Drama)
(31807,Dot the I (2003),Film-Noir)
(31807,Dot the I (2003),Thriller)
*/

PIG script for creating IDxCITY matrix from given csv file

I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0

You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pig one row to multiple row - apache-pig

Can you please provide Pig script for below query? here's input format. Input ID, Label 122,a|b 215,q|b|c 214,Z|b|c 218,w|b|c 211,r|b|c 219,u|b Expected output 122,a 122,b 215,q 215,b 215,c 214,Z 214,b 214,c 218,w 218,b 218,c ........... Thanks, Abhi

TOKENIZE the Label, it will give a bag and than FLATTEN it, which will give you as many rows as are tuples in the bag. Sample code inpt = LOAD '....' USING PigStorage(',') AS (ID: chararray, Label : chararray); result = FOREACH inpt GENERATE ID, FLATTEN(TOKENIZE(Lable, '|')); DUMP result;

Related

Pig script to concatenate the values in the tuples

Convert the value of a column to uppercase in pig

Pig : Adding new column to existing inner Tuple in Pig

How to make multiple rows from single row in pig for movie lens dataset

PIG script for creating IDxCITY matrix from given csv file

Categories

Resources