Change the order of tuple fields - apache-pig

As an example, lets say I load two different files into a pig script
A = LOAD 'file1' USING PigStorage('\t') AS (
day:chararray,
month:chararray,
year:chararray,
message:chararray);
B = LOAD 'file2' USING PigStorage('\t) AS (
month:chararray,
day:chararray,
year:chararry,
message:chararray);
Now, notice the order of the fields is different, so if I combine them into one file C = UNION A, B; I get...
(2,OCT,2013,INFO INVALID USERNAME)
(OCT,3,2013,WARN STACK OVERFLOW)
If for no other reason than to make the data easier to read, I'd like to reorder the fields, so that both of them follow a common format and have the same positional notation for each field.
(2,OCT,2013,INFO INVALID USERNAME)
(3,OCT,2013,WARN STACK OVERFLOW)
This also crops up in a few other places with messages, levels, hosts, etc. It's not just date fields, I'd like to make everything "prettier" all around.
In some weird pseudo-code, I'd be looking for something like:
D = FOREACH B
REORDER (month,day,year) TO (day,month,year);
I haven't been able to find an example of anyone trying to do this and don't see a function that would do it. So maybe it's not possible and I'm alone here, but if anyone has any ideas I'd appreciate some hints.

In general, this is not necessary in Pig because you can just refer to fields by name and not worry about their position in the record. If your goal is to do a UNION of the two relations, you can achieve this by using the ONSCHEMA keyword:
C = UNION ONSCHEMA A, B;
That said, if you do really need to reorder a relation, a simple FOREACH...GENERATE is all you need:
D = FOREACH B GENERATE day, month, year, message;
Note that in your example, you are not actually working with tuples, you are working with entire records. If you did have a tuple, though, you can use the TOTUPLE built-in UDF to get where you need to go:
DESCRIBE E;
E: {t: (month: chararray,day: chararray,year: chararray,message: chararray)}
F = FOREACH E GENERATE TOTUPLE(t.day, t.month, t.year, t.message) AS t;
DESCRIBE F;
F: {t: (day: chararray,month: chararray,year: chararray,message: chararray)}

Related

PIG filter out rows with improper number of columns

I have simple data loaded in a:
dump a
ahoeh,1,e32
hello,2,10
ho,3
I need to filter out all rows with number of columns/fields different than 3. How to do it?
In other words result should be:
dump results
ahoeh,1,e32
hello,2,10
I know there should be a FILTER built-in function. However I cannot figure out what condition (number of columns =3) should be defined.
Thanks!
Can you try this?
input
ahoeh,1,e32
hello,2,10
ho,3
3,te,0
aa,3,b
y,h,3
3,3,3
3,3,3,1,2,3,3,,,,,,4,44,6
PigScript1:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,','));
C = FOREACH B GENERATE COUNT(TOBAG(*)),$0..;
D = FILTER C BY $0==3;
E = FOREACH D GENERATE $1..;
DUMP E;
PigScript2:
A = LOAD 'input' USING PigStorage(',');
B = FOREACH A GENERATE COUNT(TOBAG(*)),$0..;
C = FILTER B BY (int)$0==3;
D = FOREACH C GENERATE $1..;
DUMP D;
Output:
(ahoeh,1,e32)
(hello,2,10)
(3,te,0)
(aa,3,b)
(y,h,3)
(3,3,3)
(It seems that I don't have enough karma to comment; that's why this is posted as a new answer.)
The accepted answer doesn't quite behave as expected if null/empty string is a valid field value; you need to use COUNT_STAR instead of COUNT to count empty/null fields in your schema.
See: https://pig.apache.org/docs/r0.9.1/func.html#count-star
For example, given the following input data:
1,2,3
1,,3
and this Pig script:
a = load 'input' USING PigStorage(',');
counted = foreach a generate COUNT_STAR(TOBAG(*)), $0..;
filtered = filter counted by $0 != 3;
result = foreach filtered generate $1..;
The filtered alias will contain both rows. The difference is that COUNT({(1),(),(3)}) returns 2 while COUNT_STAR({(1),(),(3)}) returns 3.
I see two ways to do this:
First, you can rephrase the filter I think, as it boils down to: Give me all lines that do not contain an NULL value. For lots of columns, writing this filter statement is rather tedious.
Second, you could convert your columns into a bag per line, using TOBAG (http://pig.apache.org/docs/r0.12.1/func.html#tobag) and then write a UDF that processes the input bag to check for null tuples in this bag and return true or false and use this in the filter statement.
Either way, some tediousness is required I think.

Sending relation to UDF functions

Can I Send a relation to Pig UDF function as input? A relation can have multiple tuples in it. How do we read each tuple one by one in Pig UDF function?
Ok.Below is my Sample input file.
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
Amit,SBI,70000,CTS
myinput = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
grouped = GROUP myinput BY company;
All i need is details about highest paid employee in each company. How do i use UDF for that ?
I need something like this
CTS Karthic,HDFC,95000,CTS
TCS Raja,AXIS,80000,TCS
Can SomeOne Help me on this.
This script will give you the results you want :
A = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
B = GROUP A BY (company);
topResults = FOREACH B {result = TOP(1, 2, A); GENERATE FLATTEN(result);}
dump topResults;
Explanation:
First we group A on the basis of company.So A is:
(CTS,{(Surender,HDFC,60000,CTS),(Kumar,AXIS,70000,CTS),(Remya,AXIS,40000,CTS),(Ankur,HDFC,80000,CTS),(Karthic,HDFC,95000,CTS),(Sandhya,AXIS,60000,CTS),(Amit,SBI,70000,CTS)})
(TCS,{(Raja,AXIS,80000,TCS),(Raj,HDFC,70000,TCS),(Arun,SBI,30000,TCS),(Vimal,SBI,10000,TCS)})
Then we say foreach tuple in B , generate another tuple result which is equal to the top 1 record from the relation A found in B on the basis of value of column number 2 i.e. amt. The columns are numbered from 0.
Note
First your data has extra spaces after company name. Please remove the extra spaces or use the following data :
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
mit,SBI,70000,CTS
You don't need to write an UDF to do this, you can simply do it with the top function from pig : http://pig.apache.org/docs/r0.11.0/func.html#topx
Here is an example of code that should work ( not tested) :
grouped = GROUP myinput BY company;
result = FOREACH grouped GENERATE company, FLATTEN(TOP(1,2,grouped));

Using pig, how do I parse a mixed format line into tuples and a bag of tuples?

I'm new to pig, and I'm having an issue parsing my input and getting it into a format that I can use. The input file contains lines that have both fixed fields and KV pairs as follows:
FF1|FF2|FF3|FF4|KVP1|KVP2|...|KVPn
My goal here is to count the number of unique fixed field combinations for each of the KV Pairs. So considering the following input lines:
1|2|3|4|key1=value1|key2=value2
2|3|4|5|key1=value7|key2=value2|key3=value3
When I'm done, I'd like to be able to generate the following results (the output format doesn't really matter at this point, I'm just showing you what I'd like the results to be):
key1=value1 : 1
key1=value7 : 1
key2=value2 : 2
key3=value3 : 1
It seems like I should be able to do this by grouping the fixed fields and flattening a bag of the KV Pairs to generate the cross product
I've tried reading this in with something like:
data = load 'myfile' using PigStorage('|');
A = foreach data generate $0 as ff1:chararray, $1 as ff2:long, $2 as ff3:chararray, $3 as ff4:chararray, TOBAG($4..) as kvpairs:bag{kvpair:tuple()};
B = foreach A { sorted = order A by ff2; lim = limit sorted 1; generate group.ff1, group.ff4, flatten( lim.kvpairs ); };
C = filter B by ff3 matches 'somevalue';
D = foreach C generate ff1, ff4, flatten( kvpairs ) as kvpair;
E = group D by (ff1, ff4, kvpair);
F = foreach E generate group, COUNT(E);
This generates records with a schema as follows:
A: {date: long,hms: long,id: long,ff1: chararray,ff2: long,ff3: chararray,ff4: chararray,kvpairs: {kvpair: (NULL)}}
While this gets me the schema that I want, there are several problems that I can't seem to solve:
By using the TOBAG with .., no schema can be applied to my kvpairs, so I can't ever filter on kvpair, and I don't seem to be able to cast this at any point, so it's an all or nothing query.
The filter in statement 'C' seems to return no data regardless of what value I use, even if I use something like '.*' or '.+'. I don't know if this is because there is no schema, or if this is actually a bug in pig. If I dump some data from statement B, I definitely see data there that would match those expressions.
So I've tried approaching the problem differently, by loading the data using:
data = load 'myfile' using PigStorage('\n') as (line:chararray);
init_parse = foreach data generate FLATTEN( STRSPLIT( line, '\\|', 4 ) ) as (ff1:chararray, ff2:chararray, ff3:chararray, ff4:chararray, kvpairsStr:chararray);
A = foreach mc_bk_data generate ff1, ff2, ff3, ff4, TOBAG( STRSPLIT( kvpairsStr, '\\|', 500 ) ) as kvpairs:bag{t:(kvpair:chararray)};
The issue here is that the TOBAG( STRSPLIT( ... ) ) results in a bag of a single tuple, with each of the kvpairs being a field in that tuple. I really need the bag to contain, each of the individual kvpairs as a tuple of one field so that when I flatten the bag, I get the cross product of the bag and the group that I'm interested in.
I'm open to other ways of attacking this problem as well, but I can seem to find good way to transform my tuple of multiple fields into a bag of tuples, with each tuple having one field each.
I'm using Apache Pig version 0.11.1.1.3.0.0-107
Thanks in advance.
Your second approach is on the right track. Unfortunately, you'll need a UDF to convert a tuple to a bag, and as far as I know there is no builtin to do this. It's a simple matter to write one, however.
You won't want to group on the fixed fields, but rather on the key-value pairs themselves. So you only need to keep the tuple of key-value pairs; you can completely ignore the fixed fields.
The UDF is pretty simple. In Java, you can just do something like this in your exec method:
DataBag b = new DefaultDataBag();
Tuple t = (Tuple) input.get(0);
for (int i = 0; i < t.size(); i++) {
Object o = t.get(i);
Tuple e = TupleFactory.getInstance().createTuple(o);
b.add(e);
}
return b;
Once you have that, turn the tuple from STRSPLIT into a bag, flatten it, and then do the grouping and counting.

How can I split a Pig tuple into subtuples?

Here are the tuples that I am manipulating in a Pig script:
DUMP A
(4,20,53,31)
(21,3,40,16)
(15,51,12,3)
I would transform this relation (A) into another relation (B) such as:
DUMP B
(4,20)
(4,53)
(4,31)
(21,3)
(21,40)
(21,16)
(15,51)
(15,12)
(15,3)
That means keeping the first field in all tuples and get one tuple by each field. In the previous example, each tuple must give 3 new tuples. I have a solution to get:
DUMP B
(4,20)
(21,3)
(15,51)
(4,53)
(21,40)
(15,12)
(4,31)
(21,16)
(15,3)
Which is the good result but with the wrong order (I use the FOREACH operator each time). I could get the right order by adding a field to each tuple and then using the ORDER operator but I think there is a simpler way to do so.
Any idea?
Thank you.
You could do:
-- T is the name of the tuple, and v[1-4] are the values in the tuple
B = FOREACH A GENERATE T.v1, FLATTEN(TOBAG(T.v2, T.v3, T.v4)) ;
If the values do not have names, you could also do:
B = FOREACH A GENERATE T.$0, FLATTEN(TOBAG(T.$1, T.$2, T.$3)) ;
Output:
(4,20)
(4,53)
(4,31)
(21,3)
(21,40)
(21,16)
(15,51)
(15,12)
(15,3)

Apache Pig load entire relationship into UDF

I have a pig script that pertains to 2 Pig relations, lets say A and B. A is a small relationship, and B is a big one. My UDF should load all of A into memory on each machine and then use it while processing B. Currently I do it like this.
A = foreach smallRelation Generate ...
B = foreach largeRelation Generate propertyOfB;
store A into 'templocation';
C = foreach B Generate CustomUdf(propertyOfB);
I then have every machine load from 'templocation' to get A.This works, but I have two problems with it.
My understanding is I should be using the HDFS cache somehow, but I'm not sure how to load a relationship directly into the HDFS cache.
When I reload the file in my UDF I got to write logic to parse the output from A that was outputted to file when I'd rather be directly using bags and tuples (is there a built in Pig java function to parse Strings back into Bag/Tuple form?).
Does anyone know how it should be done?
Here's a trick that will work for you.
You do a GROUP ALL on A first which "bags" all data in A into one field. Then artificially add a common field on both A and B and join them. This way, foreach tuple in the enhanced B, you will have the full data of A for your UDF to use.
It's like this:
(say originally in A, you have fields fa1, fa2, fa3, in B you have fb1, fb2)
-- add an artificial join key with value 'xx'
B_aux = FOREACH B GENERATE 'xx' AS join_key, fb1, fb2;
A_all = GROUP A ALL;
A_aux = FOREACH A GENERATE 'xx' AS join_key, $1;
A_B_JOINED = JOIN B_aux BY join_key, A_aux BY join_key USING 'replicated';
C = FOREACH A_B_JOINED GENERATE CustomUdf(fb1, fb2, A_all);
since this is replicated join, it's also only map-side join.