Pig : Adding new column to existing inner Tuple in Pig - apache-pig

I want to add new column to existing tuple column in Pig.
Example:Input Schema:
name: chararray,
attribute_list: {innertuple: (height: int,length: int,size: chararray)}
Output Schema:
Using generate statement I want to add new column in tuple which will hold the same value as length but with some other name.
name: chararray,
attribute_list: {innertuple: (height: int,length: int,size: chararray, len : int)}
I tried below approach but its not working:
op = Foreach input_data generate
name,
attribute_list as attr : {(
height,
length,
size,
length as len)};
Please suggest.
Thanks in advance

Option 1:
Add a rank to each row, flatten attribute_list bag then recreate bag with additional columns.
--Rank input_schema(ip) using rank function:
ranked= rank ip;
-- flatten each value of bag.tuple to row level
a= foreach ranked generate rank_ip as id, flatten(attribute_list.$0), flatten(attribute_list.$0.length) as len;
b= group a by id;
op= foreach b generate flatten($1.name) as name, $1 as attr;
--The name also will be part of attr bag.
Option 2:
a. The DataFu has a pig udf to concat multiple tuple around bag.
b. Create UDF BagConcat.
define BagConcat datafu.pig.bags.BagConcat();
c. Flatten elements:
a= foreach ip generate name, flatten(attribute_list.$0), flatten(attribute_list.$0.length) as len;
d. Reproject your bag:
op= foreach a generate name, BagConcat(height,len,size,len) as attr;

A = LOAD 'PATH' USING PigStorage() AS (ID:INT);
B = FOREACH sourcenew GENERATE *, null as len:int;
You can also give any integer value in place of null.

Related

How to read a list of values in Pig as a bag and compare it to a specific value?

Input:
ids:
1111,2222,3333,4444
employee:
{"name":"abc","id":"1111"} {"name":"xyz","id":"10"}
{"name":"z","id":"100"} {"name":"m","id":"99"}
{"name":"pqr","id":"3333"}
I want to filter employees whose id exists in the given list.
Expected Output:
{"name":"xyz","id":"10"} {"name":"z","id":"100"}
{"name":"m","id":"99"}
Existing Code:
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
empl = LOAD 'pathToFile' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (data:map[]);
output = FILTER empl BY data#'id' in (idList);
-- not working, states: A column needs to be projected from a relation for it to be used as a scalar
output = FILTER empl BY data#'id' in (idList#id);
-- not working, states: mismatched input 'id' expecting set null
JsonLoad() is native in pig > 0.10, and you can specify the schema:
empl = LOAD 'pathToFile' USING JsonLoader('name:chararray, id:chararray');
DUMP empl;
(abc,1111)
(xyz,10)
(z,100)
(m,99)
(pqr,3333)
You're loading idList as a one column table of type chararray but you want a list.
Loading it as a one column table (implies modifying you file so there is only one record per line):
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
DUMP idList;
(1111)
(2222)
(3333)
(4444)
or as a one-line file, we'll change the separator so it doesn't split into columns (otherwise it will lead to loading only the first column):
idList = LOAD 'pathToFile' USING PigStorage(' ') AS (id:chararray);
idList = FOREACH idList GENERATE FLATTEN(TOKENIZE(id, '[,]')) AS id;
DUMP idList;
(1111)
(2222)
(3333)
(4444)
Now we can do a LEFT JOIN to see which id are not present in idList and then a FILTER to keep only those. output is a reserved keyword, you shouldn't use it:
res = JOIN empl BY id LEFT, idList BY id;
res = FILTER res BY idList::id IS NULL;
DUMP res;
(xyz,10,)
(m,99,)
(z,100,)

Convert the value of a column to uppercase in pig

I need to convert the value of a column to uppercase in pig.
Was able to do using UPPER but this creates a new column.
For example:
A = Load 'MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
Dump A;
Returns
a,b,c
d,e,f
Now I need to convert second column to upper case.
B = Foreach A generate *,UPPER(column2);
Dump B;
returns
a,b,c,B
e,f,g,F
But I need
a,B,c
e,F,g
Please let me know if there is a way to so.
I didn't tried from my side but you can try like this
B = Foreach A generate column1,UPPER(column2),column3;
Using the "*" in the below line is the reason for the extra column:
B = FOREACH A generate *, UPPER(column2);
Instead use the below:
B = Foreach A generate column1, UPPER(column2), column3;
You can do it with user define function that default provided by Apache pig
find PiggyBank Jar
command
find / -name "piggybank*.jar*"
now goto pig grunt shell
code
grunt> register /usr/local/pig-0.16.0/contrib/piggybank/java/piggybank.jar;
grunt> A = Load 'data/MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
grunt> dump A;
result
(a,b,c)
(d,e,f)
Now convert second column to upper case.
grunt> B = foreach A generate column1,org.apache.pig.piggybank.evaluation.string.UPPER(column2),column3;
grunt> dump B;
result
(a,B,c)
(d,E,f)

Pig one row to multiple row

Can you please provide Pig script for below query?
here's input format.
Input
ID, Label
122,a|b
215,q|b|c
214,Z|b|c
218,w|b|c
211,r|b|c
219,u|b
Expected output
122,a
122,b
215,q
215,b
215,c
214,Z
214,b
214,c
218,w
218,b
218,c
...........
Thanks,
Abhi
TOKENIZE the Label, it will give a bag and than FLATTEN it, which will give you as many rows as are tuples in the bag. Sample code
inpt = LOAD '....' USING PigStorage(',') AS (ID: chararray, Label : chararray);
result = FOREACH inpt GENERATE ID, FLATTEN(TOKENIZE(Lable, '|'));
DUMP result;

Error 1045 on sum function in pig latin with an int

The following pig latin script:
data = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int, size:int);
splitDate = foreach data generate size as size:int , ip as ip, FLATTEN(STRSPLIT(date, ':')) as h;
groupedIp = group splitDate by h.$1;
a = foreach groupedIp{
added = foreach splitDate generate SUM(size); --
generate added;
};
describe a;
gives me the error:
ERROR 1045:
<file 3.pig, line 10, column 39> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
This error makes me think I need to cast size as an int, but if i describe my groupedIp field, I get the following schema.
groupedIp: {group: bytearray,splitDate: {(size: int,ip: chararray,h: bytearray)}} which indicates that size is an int, and should be able to be used by the sum function.
Am I calling the sum function incorrectly? Let me know if you would like to see any thing else, such as the input file.
SUM operates on a bag as input, but you pass it the field 'size'.
Try to eliminate the nested foreach and use:
a = foreach groupedIp generate SUM(splitDate.size);
Do some dumps of your data. I'll bet some of the stuff in the size column is non-integer, and Pig runs into that and dies. You could also code up your own isInteger udf to check this before the rest of your processing, and throw out any that aren't integers.
SUM, AVG and COUNT are functions that always work on a bag, therefore group the data and then join with the original set like below:
A = load 'nyse_data.txt' as (exchange:chararray, symbol:chararray,date:chararray, pen:float,high:float, low:float, close:float,volume:int, adj_close:float);
G = group A by symbol;
C = foreach G generate group, SUM(A.open);

How do I specify a field name for a wrapped tuple in Pig?

I have a tuple with schema (a:int, b:int, c:int) stored in alias first. I want to convert each tuple to have a new relation second with schema like this:
(d: (a:int, b:int, c:int))
Basically, I've wrapped my initial tuple in another tuple and named the field. This is in preparation for a cross operation where I want to cross two relations but keep each one in a named field.
Here is what I would expect it to look like, except there's an error:
second = FOREACH first GENERATE TOTUPLE(*) AS (d:tuple);
This errors out too:
second = FOREACH first GENERATE TOTUPLE(*) AS (d:tuple (a:int, b:int, c:int));
Thanks!
Uri
What about:
second = FOREACH first GENERATE TOTUPLE(*) AS d;
describe second;
second: {d: (a: int,b: int,c: int)}