Can I pass parameters to UDFs in Pig script? - apache-pig

I am relatively new to PigScript. I would like to know if there is a way of passing parameters to Java UDFs in Pig?
Here is the scenario:
I have a log file which have different columns (each representing a Primary Key in another table). My task is to get the count of distinct primary key values in the selected column.
I have written a Pig script which does the job of getting the distinct primary keys and counting them.
However, I am now supposed to write a new UDF for each column. Is there a better way to do this? Like if I can pass a row number as parameter to UDF, it avoids the need for me writing multiple UDFs.

The way to do it is by using DEFINE and the constructor of the UDF. So here is an example of a customer "splitter":
REGISTER com.sample.MyUDFs.jar;
DEFINE CommaSplitter com.sample.MySplitter(',');
B = FOREACH A GENERATE f1, CommaSplitter(f2);
Hopefully that conveys the idea.

To pass parameters you do the following in your pigscript:
UDF(document, '$param1', '$param2', '$param3')
edit: Not sure if those params need to be wrappedin ' ' or not
while in your UDF you do:
public class UDF extends EvalFunc<Boolean> {
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return false;
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
String var1 = input.get(1).toString();
InputStream var1In = fs.open(new Path(var1));
String var2 = input.get(2).toString();
InputStream var2In = fs.open(new Path(var2));
String var3 = input.get(3).toString();
InputStream var3In = fs.open(new Path(var3));
return doyourthing(input.get(0).toString());
}
}
for example

Yes, you can pass any parameter in the Tuple parameter input of your UDF:
exec(Tuple input)
and access it using
input.get(index)

Related

How to use Lambda expressions in java for nested if-else and for loop together

I have following Code where i will receive list of names as parameter.In the loop, first i'm assigning index 0 value from list to local variable name. There after comparing next values from list with name. If we receive any non-equal value from list, i'm assigning value of result as 1 and failing the test case.
Below is the Array list
List<String> names= new ArrayList<String>();
names.add("John");
names.add("Mark");
Below is my selenium test method
public void test(List<String> names)
String name=null;
int a=0;
for(String value:names){
if(name==null){
System.out.println("Value is null");
name=value;
}
else if(name.equals(value)){
System.out.println("Received Same name");
name=value;
}
else{
a=1;
Assert.fail("Received different name in between");
}
}
How can i convert above code into lambda expressions?. I'm using cucumber data model, hence i receive data as list from feature file. Since i can't give clear explanation, just posted the example logic i need to convert to lambda expression.
Here's the solution: it cycles all element in your list checking if are all the same.
You can try adding or editing the list so you can have different outputs. I've written the logic, you can easly put it into a JUnit test
List<String> names= new ArrayList<>();
names.add("John");
names.add("Mark");
String firstEntry = names.get(0);
boolean allMatch = names.stream().allMatch(name -> firstEntry.equals(name));
System.out.println("All names are the same: "+allMatch);
Are you looking for duplicates, whenever you have distinct value , set a=1 and say assert to fail. You can achieve this by :
List<String> names= new ArrayList<String>();
names.add("John");
names.add("Mark");
if (names.stream().distinct().limit(2).count() > 1) {
a= 1,
Assert.fail("Received different name in between");
} else {
System.out.println("Received Same name");
}

how to return the sum of a value in a table with where clause in grails 2.5.0

Domain class:
class Transaction {
String roundId
BigDecimal amount
:
}
The SQL we wish to execute the following:
"select sum(t.amount) from transaction t where t.roundId = xxx"
We have been unable to find an example which does not return Transaction rows.
We assume there are two approaches:
Use projections and/or criteria etc? All the examples we have found only return lists of transaction rows, not the sum.
Use raw SQL. How do we call SQL, and get a handle on the BigDecimal it returns?
I tried this:
class bla{
def sessionFactory
def someMethod() {
def SQLsession = sessionFactory.getCurrentSession()
def results = SQLsession.createSQLQuery("select sum(t.credit) from transaction t where t.round_id = :roundId", [roundId: roundId])
But this fails with
groovy.lang.MissingMethodException: No signature of method: org.hibernate.internal.SessionImpl.createSQLQuery() is applicable for argument types: (java.lang.String, java.util.LinkedHashMap)
Also, I have no idea what the return type would be (cant find any documentation). I am guessing it will be a list of something: Arrays? Maps?
==== UPDATE ====
Found one way which works (not very elegant or grails like)
def SQLsession = sessionFactory.getCurrentSession()
final query = "select sum(t.credit) from transaction t where t.round_id = :roundId"
final sqlQuery = SQLsession.createSQLQuery(query)
final results = sqlQuery.with {
setString('roundId', roundId)
list() // what is this for? Is there a better return value?
}
This seems to return an array, not a list as expected, so I can do this:
if (results?.size == 1) {
println results[0] // outputs a big decimal
}
Strangely, results.length fails, but results.size works.
Using Criteria, you can do
Transaction.withCriteria {
eq 'roundId', yourRoundIdValueHere
projections {
sum 'amount'
}
}
https://docs.jboss.org/hibernate/core/3.3/api/org/hibernate/classic/Session.html
Query createSQLQuery(String sql, String[] returnAliases, Class[] returnClasses)
Query createSQLQuery(String sql, String returnAlias, Class returnClass)
The second argument of createSQLQuery is one or more returnAliases and not meant for binding the statement to a value.
Instead of passing your values in the 2nd argument, use the setters of your Query object i.e. setString, setInteger, etc.
results.setInteger('roundId',roundId);

Spark JavaPairRDD iteration

How can iterate on JavaPairRDD. I have done a group by and got back a RDD as below JavaPairRDD (Tuple 7 set of Strings and List of Objects)
Now I have to iterate over this RDD and do some calculations like FOR EACH in Pig.
Basically I would like to iterate the key and the list of values and do some operations and then return back a JavaPairRDD?
JavaPairRDD<Tuple7<String, String,String,String,String,String,String>, List<Records>> sizes =
piTagRecordData.groupBy( new Function<Records, Tuple7<String, String,String,String,String,String,String>>() {
private static final long serialVersionUID = 2885738359644652208L;
#Override
public Tuple7<String, String,String,String,String,String,String> call(Records row) throws Exception {
Tuple7<String, String,String,String,String,String,String> compositeKey = new Tuple7<String, String, String, String, String, String, String>(row.getAsset_attribute_id(),row.getDate_time_value(),row.getOperation(),row.getPi_tag_count(),row.getAsset_id(),row.getAttr_name(),row.getCalculation_type());
return compositeKey;
}
});
After this I want to perform FOR EACH member of sizes (JavaPairRDD), operation -- something like
rejected_records = FOREACH sizes GENERATE FLATTEN(Java function on the List of Records based on the group key
I am using Spark 0.9.0
Even though you are talking about "FOR EACH", it really sounds like you want the flatMap operation, since you want to produce new values and flatten them. This is available for Java RDDs, including a JavaPairRDD.
You can use void foreach(VoidFunction<T> f) method. More info and methods: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/api/java/JavaRDDLike.html#foreach(org.apache.spark.api.java.function.VoidFunction)
if you want to view some value of JavaPairRDD, I would do like this
for (Tuple2<String, String> test : pairRdd.take(10)) //or pairRdd.collect()
{
System.out.println(test._1);
System.out.println(test._2);
}
Note:Tuple2 (assuming you have strings inside the JavaPairRDD), change the datatype according to the data type stored in the JavaPairRDD.

How can I generate schema from text file? (Hadoop-Pig)

Somehow i got filename.log which looks like for example (tab separated)
Name:Peter Age:18
Name:Tom Age:25
Name:Jason Age:35
because the value of key column may differ i cannot define schema when i load text like
a = load 'filename.log' as (Name:chararray,Age:int);
Neither do i want to call column by position like
b = foreach a generate $0,$1;
What I want to do is, from only that filename.log, to make it possible to call each value by key, for example
a = load 'filename.log' using PigStorage('\t');
b = group b by Name;
c = foreach b generate group, COUNT(b);
dump c;
for that purpose, i wrote some Java UDF which seperate key:value and get value for every field in tuple as below
public class SPLITALLGETCOL2 extends EvalFunc<Tuple>{
#Override
public Tuple exec(Tuple input){
TupleFactory mTupleFactory = TupleFactory.getInstance();
ArrayList<String> mProtoTuple = new ArrayList<String>();
Tuple output;
String target=input.toString().substring(1, input.toString().length()-1);
String[] tokenized=target.split(",");
try{
for(int i=0;i<tokenized.length;i++){
mProtoTuple.add(tokenized[i].split(":")[1]);
}
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}catch(Exception e){
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}
}
}
How should I alter this method to get what I want? or How should I write other UDF to get there?
Whatever you do, don't use a tuple to store the output. Tuples are intended to store a fixed number of fields, where you know what every field contains. Since you don't know that the keys will be in Name,Age form (or even exist, or that there won't be more) you should use a bag. Bags are unordered sets of tuples. They can have any number of tuples in them as long as the tuples have the same schema. These are all valid bags for the schema B: {T:(key:chararray, value:chararray)}:
{(Name,Foo),(Age,Bar)}
{(Age,25),(Name,Jim)}
{(Name,Bob)}
{(Age,30),(Name,Roger),(Hair Color,Brown)}
{(Hair Color,),(Name,Victor)} -- Note the Null value for Hair Color
However, it sounds like you really want a map:
myudf.py
#outputSchema('M:map[]')
def mapize(the_input):
out = {}
for kv in the_input.split(' '):
k, v = kv.split(':')
out[k] = v
return out
myscript.pig
register '../myudf.py' using jython as myudf ;
A = LOAD 'filename.log' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.mapize(total) ;
-- Sample usage, grouping by the name key.
C = GROUP B BY M#'Name' ;
Using the # operator you can pull out all values from the map using the key you give. You can read more about maps here.

how to assign a sequentially increasing number to a relation

I am wondering if there is a way to create a new field in a relation and then assign some sequentially increasing number to it? Here is one example:
ordered_products = ORDER products BY price ASC;
ordered_products_with_sequential_id = FOREACH ordered_products GENERATE price, some_sequential_id;
How can I create some_sequential_id? I am not sure whether that's doable in Pig though.
I suspect you have to write your own UDF to get that running. One way to do it would be by incrementing a static variable (AtomicInteger) in your UDF in the exec implementation.
public class IncrEval extends EvalFunc<Long> {
final static AtomicLong res = new AtomicLong(0);
#Override
public Long exec (Tuple tip) throws IOException {
if (tip == null || tip.size() == 0) {
return null;
}
res.incrementAndGet();
return res.longValue();
}
}
Pig script entry:
b = FOREACH a GENERATE <something>, com.eval.IncrEval() AS ID:long;