Apache Pig: strip namespace prefix (::) after group operation - apache-pig

A common pattern in my data processing is to group by some set of columns, apply a filter, then flatten again. For example:
my_data_grouped = group my_data by some_column;
my_data_grouped = filter my_data_grouped by <some expression>;
my_data = foreach my_data_grouped flatten(my_data);
The problem here is that if my_data starts with a schema like (c1, c2, c3) after this operation it will have a schema like (mydata::c1, mydata::c2, mydata::c3). Is there a way to easily strip off the "mydata::" prefix if the columns are unique?
I know I can do something like this:
my_data = foreach my_data generate c1 as c1, c2 as c2, c3 as c3;
However that gets awkward and hard to maintain for data sets with lots of columns and is impossible for data sets with variable columns.

If all fields in a schema have the same set of prefixes (e.g. group1::id, group1::amount, etc) you can ignore the prefix when referencing specific fields (and just reference them as id, amount, etc)
Alternatively, if you're still looking to strip a schema of a single level of prefixing you can use a UDF like this:
public class RemoveGroupFromTupleSchema extends EvalFunc<Tuple> {
#Override
public Tuple exec(Tuple input) throws IOException {
Tuple result = input;
return result;
}
#Override
public Schema outputSchema(Schema input) throws FrontendException {
if(input.size() != 1) {
throw new RuntimeException("Expected input (tuple) but input does not have 1 field");
}
List<Schema.FieldSchema> inputSchema = input.getFields();
List<Schema.FieldSchema> outputSchema = new ArrayList<Schema.FieldSchema>(inputSchema);
for(int i = 0; i < inputSchema.size(); i++) {
Schema.FieldSchema thisInputFieldSchema = inputSchema.get(i);
String inputFieldName = thisInputFieldSchema.alias;
Byte dataType = thisInputFieldSchema.type;
String outputFieldName;
int findLoc = inputFieldName.indexOf("::");
if(findLoc == -1) {
outputFieldName = inputFieldName;
}
else {
outputFieldName = inputFieldName.substring(findLoc+2);
}
Schema.FieldSchema thisOutputFieldSchema = new Schema.FieldSchema(outputFieldName, dataType);
outputSchema.set(i, thisOutputFieldSchema);
}
return new Schema(outputSchema);
}
}

You can put the 'AS' statement on the same line as the 'foreach'.
i.e.
my_data_grouped = group my_data by some_column;
my_data_grouped = filter my_data_grouped by <some expression>;
my_data = FOREACH my_data_grouped FLATTEN(my_data) AS (c1, c2, c3);
However, this is just the same as doing it on 2 lines, and does not alleviate your issue for 'data sets with variable columns'.

Related

How to join and filter 2 relations by columns in Pig?

I'm a novice to Pig script, and trying to modify some existing pig script to extract some data from log files.
E.g. I have 2 log files, one with the schema as:
message Class {
message Student {
optional int32 uid = 1;
optional string name = 2;
}
optional int32 cid = 1;
repeated Student students = 2;
}
After loading, I think a bag (say, bag1) is created (correct me if I'm wrong):
bag1:
{
(uid1, {(cid11, name11), (cid12, name12), (cid13, name13), ...}),
(uid2, {(cid21, name21), (cid22, name22), (cid23, name23), ...}),
...
}
And another log file is simple, the resulting bag (bag2) is like this.
bag2:
{
(name11),
(name13),
(name22),
...
}
What I want is, get all the rows from bag1 if any name in bag2 is contained inside the row, like:
result bag:
{
(uid1, (name11, name13)),
(uid2, (name22)),
}
I think I'll need to do some join/filter on these 2 bags, but don't know how.
I tried a script snippet like below, but it's even not a valid script.
res = FOREACH bag1 {
names = FOREACH students GENERATE name;
xnames = JOIN names by name, bag2 by name;
GENERATE cid, xnames;
};
FILTER res BY not IsEmpty(xnames);
So could anyone pls. give me some help on the script?
You won't be able to use JOIN inside a nested FOREACH, you can try flattening your tuple and then join it with the second table:
bag1_flat = FOREACH bag1 GENERATE $0 AS uid, FLATTEN($1);
bag1_flat = FOREACH bag1_flat GENERATE uid, $2 AS name;
An inner join, will filter the lines :
bag12 = JOIN bag1_flat by name, bag2 by $0;
bag12 = FOREACH bag12 GENERATE bag1_flat::uid AS uid, bag1_flat::name AS name;
Finally, group by uid you won't get tuples though as they cannot be different sizes, you'll get bags:
bag12_group = GROUP bag12 BY uid;
res = FOREACH bag12_group GENERATE group AS uid, bag12.name AS names;

Need to implement ETL logic of generating sequence number through Pig Script

I have a current ETL Logic which I have to implement in pig.
ETL Logic is creating a unique sequence number for the column if incoming value is null or blank.
Need to do this through pig.
you can generate sequence number using RANK but in your condition it is little bit different you are checking if that value is either '0' or 'null' then only you are assigning sequence number..
My point of view you should use UDF for this..
package pig.test;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class SequenceNumber extends EvalFunc<Integer> {
static int cnt = 0;
public Integer exec(Tuple v) throws IOException{
int a = (Integer)v.get(0);
if(a == 0) {
cnt++ ;
return new Integer(cnt);
}
else
return new Integer(a);
}
}
In pig:
--Replace all null with 0
Step-1 A1 = foreach A generate *, (id is null ? 0 : id) as sq;
Step-2 T1 = foreach A1 generate sq,<your_fields>,<your_fields>;
Step-3 Result = foreach T1 generate sqno(*),<your_fields>,<your_fields>;
Hope this will help!!

How can I generate schema from text file? (Hadoop-Pig)

Somehow i got filename.log which looks like for example (tab separated)
Name:Peter Age:18
Name:Tom Age:25
Name:Jason Age:35
because the value of key column may differ i cannot define schema when i load text like
a = load 'filename.log' as (Name:chararray,Age:int);
Neither do i want to call column by position like
b = foreach a generate $0,$1;
What I want to do is, from only that filename.log, to make it possible to call each value by key, for example
a = load 'filename.log' using PigStorage('\t');
b = group b by Name;
c = foreach b generate group, COUNT(b);
dump c;
for that purpose, i wrote some Java UDF which seperate key:value and get value for every field in tuple as below
public class SPLITALLGETCOL2 extends EvalFunc<Tuple>{
#Override
public Tuple exec(Tuple input){
TupleFactory mTupleFactory = TupleFactory.getInstance();
ArrayList<String> mProtoTuple = new ArrayList<String>();
Tuple output;
String target=input.toString().substring(1, input.toString().length()-1);
String[] tokenized=target.split(",");
try{
for(int i=0;i<tokenized.length;i++){
mProtoTuple.add(tokenized[i].split(":")[1]);
}
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}catch(Exception e){
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}
}
}
How should I alter this method to get what I want? or How should I write other UDF to get there?
Whatever you do, don't use a tuple to store the output. Tuples are intended to store a fixed number of fields, where you know what every field contains. Since you don't know that the keys will be in Name,Age form (or even exist, or that there won't be more) you should use a bag. Bags are unordered sets of tuples. They can have any number of tuples in them as long as the tuples have the same schema. These are all valid bags for the schema B: {T:(key:chararray, value:chararray)}:
{(Name,Foo),(Age,Bar)}
{(Age,25),(Name,Jim)}
{(Name,Bob)}
{(Age,30),(Name,Roger),(Hair Color,Brown)}
{(Hair Color,),(Name,Victor)} -- Note the Null value for Hair Color
However, it sounds like you really want a map:
myudf.py
#outputSchema('M:map[]')
def mapize(the_input):
out = {}
for kv in the_input.split(' '):
k, v = kv.split(':')
out[k] = v
return out
myscript.pig
register '../myudf.py' using jython as myudf ;
A = LOAD 'filename.log' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.mapize(total) ;
-- Sample usage, grouping by the name key.
C = GROUP B BY M#'Name' ;
Using the # operator you can pull out all values from the map using the key you give. You can read more about maps here.

how to assign a sequentially increasing number to a relation

I am wondering if there is a way to create a new field in a relation and then assign some sequentially increasing number to it? Here is one example:
ordered_products = ORDER products BY price ASC;
ordered_products_with_sequential_id = FOREACH ordered_products GENERATE price, some_sequential_id;
How can I create some_sequential_id? I am not sure whether that's doable in Pig though.
I suspect you have to write your own UDF to get that running. One way to do it would be by incrementing a static variable (AtomicInteger) in your UDF in the exec implementation.
public class IncrEval extends EvalFunc<Long> {
final static AtomicLong res = new AtomicLong(0);
#Override
public Long exec (Tuple tip) throws IOException {
if (tip == null || tip.size() == 0) {
return null;
}
res.incrementAndGet();
return res.longValue();
}
}
Pig script entry:
b = FOREACH a GENERATE <something>, com.eval.IncrEval() AS ID:long;

Pig: Get index in nested foreach

I have a pig script with code like :
scores = LOAD 'file' as (id:chararray, scoreid:chararray, score:int);
scoresGrouped = GROUP scores by id;
top10s = foreach scoresGrouped{
sorted = order scores by score DESC;
sorted10 = LIMIT sorted 10;
GENERATE group as id, sorted10.scoreid as top10candidates;
};
It gets me a bag like
id1, {(scoreidA),(scoreidB),(scoreIdC)..(scoreIdFoo)}
However, I wish to include the index of items as well, so I'd have results like
id1, {(scoreidA,1),(scoreidB,2),(scoreIdC,3)..(scoreIdFoo,10)}
Is it possible to include the index somehow in the nested foreach, or would I have to write my own UDF to add it in afterwards?
For indexing elements in a bag you may use the Enumerate UDF from LinkedIn's DataFu project:
register '/path_to_jar/datafu-0.0.4.jar';
define Enumerate datafu.pig.bags.Enumerate('1');
scores = ...
...
result = foreach top10s generate id, Enumerate(top10candidates);
You'll need a UDF, whose only argument is the sorted bag you want to add a rank to. I've had the same need before. Here's the exec function to save you a little time:
public DataBag exec(Tuple b) throws IOException {
try {
DataBag bag = (DataBag) b.get(0);
Iterator<Tuple> it = bag.iterator();
while (it.hasNext()) {
Tuple t = (Tuple)it.next();
if (t != null && t.size() > 0 && t.get(0) != null) {
t.append(n++);
}
newBag.add(t);
}
} catch (ExecException ee) {
throw ee;
} catch (Exception e) {
int errCode = 2106;
String msg = "Error while computing item number in " + this.getClass().getSimpleName();
throw new ExecException(msg, errCode, PigException.BUG, e);
}
return newBag;
}
(The counter n is initialized as a class variable outside the exec function.)
You can also implement the Accumulator interface, which will allow you to do this even if your entire bag won't fit in memory. (The COUNT built-in function does this.) Just be sure to set n = 1L; in the cleanup() method and return newBag; in getValue(), and everything else is the same.