Need to implement ETL logic of generating sequence number through Pig Script - apache-pig

I have a current ETL Logic which I have to implement in pig.
ETL Logic is creating a unique sequence number for the column if incoming value is null or blank.
Need to do this through pig.

you can generate sequence number using RANK but in your condition it is little bit different you are checking if that value is either '0' or 'null' then only you are assigning sequence number..
My point of view you should use UDF for this..
package pig.test;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class SequenceNumber extends EvalFunc<Integer> {
static int cnt = 0;
public Integer exec(Tuple v) throws IOException{
int a = (Integer)v.get(0);
if(a == 0) {
cnt++ ;
return new Integer(cnt);
}
else
return new Integer(a);
}
}
In pig:
--Replace all null with 0
Step-1 A1 = foreach A generate *, (id is null ? 0 : id) as sq;
Step-2 T1 = foreach A1 generate sq,<your_fields>,<your_fields>;
Step-3 Result = foreach T1 generate sqno(*),<your_fields>,<your_fields>;
Hope this will help!!

Related

How can I generate schema from text file? (Hadoop-Pig)

Somehow i got filename.log which looks like for example (tab separated)
Name:Peter Age:18
Name:Tom Age:25
Name:Jason Age:35
because the value of key column may differ i cannot define schema when i load text like
a = load 'filename.log' as (Name:chararray,Age:int);
Neither do i want to call column by position like
b = foreach a generate $0,$1;
What I want to do is, from only that filename.log, to make it possible to call each value by key, for example
a = load 'filename.log' using PigStorage('\t');
b = group b by Name;
c = foreach b generate group, COUNT(b);
dump c;
for that purpose, i wrote some Java UDF which seperate key:value and get value for every field in tuple as below
public class SPLITALLGETCOL2 extends EvalFunc<Tuple>{
#Override
public Tuple exec(Tuple input){
TupleFactory mTupleFactory = TupleFactory.getInstance();
ArrayList<String> mProtoTuple = new ArrayList<String>();
Tuple output;
String target=input.toString().substring(1, input.toString().length()-1);
String[] tokenized=target.split(",");
try{
for(int i=0;i<tokenized.length;i++){
mProtoTuple.add(tokenized[i].split(":")[1]);
}
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}catch(Exception e){
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}
}
}
How should I alter this method to get what I want? or How should I write other UDF to get there?
Whatever you do, don't use a tuple to store the output. Tuples are intended to store a fixed number of fields, where you know what every field contains. Since you don't know that the keys will be in Name,Age form (or even exist, or that there won't be more) you should use a bag. Bags are unordered sets of tuples. They can have any number of tuples in them as long as the tuples have the same schema. These are all valid bags for the schema B: {T:(key:chararray, value:chararray)}:
{(Name,Foo),(Age,Bar)}
{(Age,25),(Name,Jim)}
{(Name,Bob)}
{(Age,30),(Name,Roger),(Hair Color,Brown)}
{(Hair Color,),(Name,Victor)} -- Note the Null value for Hair Color
However, it sounds like you really want a map:
myudf.py
#outputSchema('M:map[]')
def mapize(the_input):
out = {}
for kv in the_input.split(' '):
k, v = kv.split(':')
out[k] = v
return out
myscript.pig
register '../myudf.py' using jython as myudf ;
A = LOAD 'filename.log' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.mapize(total) ;
-- Sample usage, grouping by the name key.
C = GROUP B BY M#'Name' ;
Using the # operator you can pull out all values from the map using the key you give. You can read more about maps here.

how to assign a sequentially increasing number to a relation

I am wondering if there is a way to create a new field in a relation and then assign some sequentially increasing number to it? Here is one example:
ordered_products = ORDER products BY price ASC;
ordered_products_with_sequential_id = FOREACH ordered_products GENERATE price, some_sequential_id;
How can I create some_sequential_id? I am not sure whether that's doable in Pig though.
I suspect you have to write your own UDF to get that running. One way to do it would be by incrementing a static variable (AtomicInteger) in your UDF in the exec implementation.
public class IncrEval extends EvalFunc<Long> {
final static AtomicLong res = new AtomicLong(0);
#Override
public Long exec (Tuple tip) throws IOException {
if (tip == null || tip.size() == 0) {
return null;
}
res.incrementAndGet();
return res.longValue();
}
}
Pig script entry:
b = FOREACH a GENERATE <something>, com.eval.IncrEval() AS ID:long;

Pig: Get index in nested foreach

I have a pig script with code like :
scores = LOAD 'file' as (id:chararray, scoreid:chararray, score:int);
scoresGrouped = GROUP scores by id;
top10s = foreach scoresGrouped{
sorted = order scores by score DESC;
sorted10 = LIMIT sorted 10;
GENERATE group as id, sorted10.scoreid as top10candidates;
};
It gets me a bag like
id1, {(scoreidA),(scoreidB),(scoreIdC)..(scoreIdFoo)}
However, I wish to include the index of items as well, so I'd have results like
id1, {(scoreidA,1),(scoreidB,2),(scoreIdC,3)..(scoreIdFoo,10)}
Is it possible to include the index somehow in the nested foreach, or would I have to write my own UDF to add it in afterwards?
For indexing elements in a bag you may use the Enumerate UDF from LinkedIn's DataFu project:
register '/path_to_jar/datafu-0.0.4.jar';
define Enumerate datafu.pig.bags.Enumerate('1');
scores = ...
...
result = foreach top10s generate id, Enumerate(top10candidates);
You'll need a UDF, whose only argument is the sorted bag you want to add a rank to. I've had the same need before. Here's the exec function to save you a little time:
public DataBag exec(Tuple b) throws IOException {
try {
DataBag bag = (DataBag) b.get(0);
Iterator<Tuple> it = bag.iterator();
while (it.hasNext()) {
Tuple t = (Tuple)it.next();
if (t != null && t.size() > 0 && t.get(0) != null) {
t.append(n++);
}
newBag.add(t);
}
} catch (ExecException ee) {
throw ee;
} catch (Exception e) {
int errCode = 2106;
String msg = "Error while computing item number in " + this.getClass().getSimpleName();
throw new ExecException(msg, errCode, PigException.BUG, e);
}
return newBag;
}
(The counter n is initialized as a class variable outside the exec function.)
You can also implement the Accumulator interface, which will allow you to do this even if your entire bag won't fit in memory. (The COUNT built-in function does this.) Just be sure to set n = 1L; in the cleanup() method and return newBag; in getValue(), and everything else is the same.

Can I pass parameters to UDFs in Pig script?

I am relatively new to PigScript. I would like to know if there is a way of passing parameters to Java UDFs in Pig?
Here is the scenario:
I have a log file which have different columns (each representing a Primary Key in another table). My task is to get the count of distinct primary key values in the selected column.
I have written a Pig script which does the job of getting the distinct primary keys and counting them.
However, I am now supposed to write a new UDF for each column. Is there a better way to do this? Like if I can pass a row number as parameter to UDF, it avoids the need for me writing multiple UDFs.
The way to do it is by using DEFINE and the constructor of the UDF. So here is an example of a customer "splitter":
REGISTER com.sample.MyUDFs.jar;
DEFINE CommaSplitter com.sample.MySplitter(',');
B = FOREACH A GENERATE f1, CommaSplitter(f2);
Hopefully that conveys the idea.
To pass parameters you do the following in your pigscript:
UDF(document, '$param1', '$param2', '$param3')
edit: Not sure if those params need to be wrappedin ' ' or not
while in your UDF you do:
public class UDF extends EvalFunc<Boolean> {
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return false;
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
String var1 = input.get(1).toString();
InputStream var1In = fs.open(new Path(var1));
String var2 = input.get(2).toString();
InputStream var2In = fs.open(new Path(var2));
String var3 = input.get(3).toString();
InputStream var3In = fs.open(new Path(var3));
return doyourthing(input.get(0).toString());
}
}
for example
Yes, you can pass any parameter in the Tuple parameter input of your UDF:
exec(Tuple input)
and access it using
input.get(index)

Apache Pig: strip namespace prefix (::) after group operation

A common pattern in my data processing is to group by some set of columns, apply a filter, then flatten again. For example:
my_data_grouped = group my_data by some_column;
my_data_grouped = filter my_data_grouped by <some expression>;
my_data = foreach my_data_grouped flatten(my_data);
The problem here is that if my_data starts with a schema like (c1, c2, c3) after this operation it will have a schema like (mydata::c1, mydata::c2, mydata::c3). Is there a way to easily strip off the "mydata::" prefix if the columns are unique?
I know I can do something like this:
my_data = foreach my_data generate c1 as c1, c2 as c2, c3 as c3;
However that gets awkward and hard to maintain for data sets with lots of columns and is impossible for data sets with variable columns.
If all fields in a schema have the same set of prefixes (e.g. group1::id, group1::amount, etc) you can ignore the prefix when referencing specific fields (and just reference them as id, amount, etc)
Alternatively, if you're still looking to strip a schema of a single level of prefixing you can use a UDF like this:
public class RemoveGroupFromTupleSchema extends EvalFunc<Tuple> {
#Override
public Tuple exec(Tuple input) throws IOException {
Tuple result = input;
return result;
}
#Override
public Schema outputSchema(Schema input) throws FrontendException {
if(input.size() != 1) {
throw new RuntimeException("Expected input (tuple) but input does not have 1 field");
}
List<Schema.FieldSchema> inputSchema = input.getFields();
List<Schema.FieldSchema> outputSchema = new ArrayList<Schema.FieldSchema>(inputSchema);
for(int i = 0; i < inputSchema.size(); i++) {
Schema.FieldSchema thisInputFieldSchema = inputSchema.get(i);
String inputFieldName = thisInputFieldSchema.alias;
Byte dataType = thisInputFieldSchema.type;
String outputFieldName;
int findLoc = inputFieldName.indexOf("::");
if(findLoc == -1) {
outputFieldName = inputFieldName;
}
else {
outputFieldName = inputFieldName.substring(findLoc+2);
}
Schema.FieldSchema thisOutputFieldSchema = new Schema.FieldSchema(outputFieldName, dataType);
outputSchema.set(i, thisOutputFieldSchema);
}
return new Schema(outputSchema);
}
}
You can put the 'AS' statement on the same line as the 'foreach'.
i.e.
my_data_grouped = group my_data by some_column;
my_data_grouped = filter my_data_grouped by <some expression>;
my_data = FOREACH my_data_grouped FLATTEN(my_data) AS (c1, c2, c3);
However, this is just the same as doing it on 2 lines, and does not alleviate your issue for 'data sets with variable columns'.