Applying TRIM() in Pig for all fields in a tuple - apache-pig

I am loading a CSV file with 56 fields. I want to apply TRIM() function in Pig for all fields in the tuple.
I tried:
B = FOREACH A GENERATE TRIM(*);
But it fails with below error-
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching
function for org.apache.pig.builtin.TRIM as multiple or none of them
fit. Please use an explicit cast.
Please help. Thank you.

To Trim a tuple in the Pig, you should create a UDF. Register the UDF and apply the UDF with Foreach statement to the field of the tuple you want to trim. Below is the code for trimming the tuple with UDF.
public class StrTrim extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try {
String str = (String)input.get(0);
return str.trim();
}
catch(Exception e) {
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}

Related

PIG UDF to convert tuple to multiple tuple output

I am new to PIG and I am trying to create a UDF which get a tuple and return multiple tuple based on a delimited. So I have written one UDF to read the below data file
2012/01/01 Name1 Category1|Category2|Category3
2012/01/01 Name2 Category2|Category3
2012/01/01 Name3 Category1|Category5
Basically i am trying to read $2 field
Category1|Category2|Category3
Category2|Category3
Category1|Category5
to get the output as :-
Category1, Category2, Category3
Category2, Category3
Category1, Category5
Below is the UDF code i have written..
package com.test.multipleTuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class TupleToMultipleTuple extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
// Keep the count of every cell in the
Tuple aux = TupleFactory.getInstance().newTuple();
if (input == null || input.size() == 0)
return null;
try {
String del = "\\|";
String str = (String) input.get(0);
String field[] = str.split(del);
for (String nxt : field) {
aux.append(nxt.trim().toString());
}
} catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
return aux.toDelimitedString(",");
}
}
created Jar --> TupleToMultipleTuple.jar
But I am getting the below error while executing it .
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias B
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias B
at org.apache.pig.PigServer.openIterator(PigServer.java:892)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:774)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:547)
at org.apache.pig.Main.main(Main.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
at org.apache.pig.PigServer.openIterator(PigServer.java:884)
... 13 more
Can you please help me in rectifying the issue. Thanks.
Pig script for applying the UDF..
REGISTER TupleToMultipleTuple.jar;
DEFINE myFunc com.test.multipleTuple.TupleToMultipleTuple();
A = load 'data.txt' USING PigStorage(' ');
B = foreach A generate myFunc($2);
dump B;
You can use the built-in split function like this:
flatten(STRSPLIT($2,'[|]',3))as(cat1:chararray,cat2:chararray,cat3:chararray)
and you will get 3 tuples named cat1, cat2 and cat2 typed as chararray and delimited by the current delimiter of the relation which they belong to.
Found the issue .. Issue was while parsing DayaByteArray to String.. used toString() to fix it
package com.test.multipleTuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class TupleToMultipleTuple extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
// Keep the count of every cell in the
Tuple aux = TupleFactory.getInstance().newTuple();
if (input == null || input.size() == 0)
return null;
try {
String del = "\\|";
String str = (String) input.get(0).toString();
String field[] = str.split(del);
for (String nxt : field) {
aux.append(nxt.trim().toString());
}
} catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
return aux.toDelimitedString(",");
}
}

Pig udf on Filter

I have a use-case in which i need to take in the date of a month to return the previous month's last date.
Ex: input:20150331 output:20150228
I will be using this previous month's last date to filter a daily partition(in pig script).
B = filter A by daily_partition == GetPrevMonth(20150331);
I have created an UDF(GetPrevMonth) which takes the date and returns the previous month's last date.But unable to use it on the filter.
ERROR:Could not infer the matching function for GetPrevMonth as multiple or none of them fit. Please use an explicit cast.
My udf takes tuple as input.
Googling it says that UDF cannot be applied on filters.
Is there any workaround? or am i going wrong somewhere?
UDF:public class GetPrevMonth extends EvalFunc<Integer> {
public Integer exec(Tuple input) throws IOException {
String getdate = (String) input.get(0);
if (getdate != null){
try{
//LOGIC to return prev month date
}
Need help.Thanks in advance.
You can call a UDF in a FILTER, but you are passing a number to the function while you expect it to receive a String (chararray inside Pig):
String getdate = (String) input.get(0);
The simple solution would be to cast it to chararray when calling the UDF:
B = filter A by daily_partition == GetPrevMonth((chararray)20150331);
Generally, when you see some error like Could not infer the matching function for X as multiple or none of them fit, 99% of the time the reason is that the values you are trying to pass to the UDF are wrong.
One last thing, even if it is not necessary, in a future you might want to write a pure FILTER UDF. In that case, instead of inheriting from EvalFunc, you need to inherit from FilterFunc and return a Boolean value:
public class IsPrevMonth extends FilterFunc {
#Override
public Boolean exec(Tuple input) throws IOException {
try {
String getdate = (String) input.get(0);
if (getdate != null){
//LOGIC to retrieve prevMonthDate
if (getdate.equals(prevMonthDate)) {
return true;
} else {
return false;
}
} else {
return false;
}
} catch (ExecException ee) {
throw ee;
}
}
}

trouble creating a pig udf schema

Trying to parse xml and I'm having trouble with my UDF returning a tuple. Following the example from http://verboselogging.com/2010/03/31/writing-user-defined-functions-for-pig
pig script
titles = FOREACH programs GENERATE (px.pig.udf.PARSE_KEYWORDS(program))
AS (root_id:chararray, keyword:chararray);
here is the output schema code:
override def outputSchema(input: Schema): Schema = {
try {
val s: Schema = new Schema
s.add(new Schema.FieldSchema("root_id", DataType.CHARARRAY))
s.add(new Schema.FieldSchema("keyword", DataType.CHARARRAY))
return s
}
catch {
case e: Exception => {
return null
}
}
}
I'm getting this error
pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException:
ERROR 0: Given UDF returns an improper Schema.
Schema should only contain one field of a Tuple, Bag, or a single type.
Returns: {root_id: chararray,keyword: chararray}
Update Final Solution:
In java
public Schema outputSchema(Schema input) {
try {
Schema tupleSchema = new Schema();
tupleSchema.add(input.getField(1));
tupleSchema.add(input.getField(0));
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),tupleSchema, DataType.TUPLE));
} catch (Exception e) {
return null;
}
}
You will need to add your s schema instance variable to another Schema object.
Try returning a new Schema(new FieldSchema(..., input), s, DataType.TUPLE)); like in the template below:
Here is my answer in Java (fill out your variable names):
#Override
public Schema outputSchema(Schema input) {
Schema tupleSchema = new Schema();
try {
tupleSchema.add(new FieldSchema("root_id", DataType.CHARARRAY));
tupleSchema.add(new FieldSchema("keyword", DataType.CHARARRAY));
return new Schema(new FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), tupleSchema, DataType.TUPLE));
} catch (FrontendException e) {
e.printStackTrace();
return null;
}
}
Would you try:
titles = FOREACH programs GENERATE (px.pig.udf.PARSE_KEYWORDS(program));
If that doesn't error, then try:
titles = FOREACH TITLES GENERATE
$0 AS root_id
,$1 AS keyword
;
And tell me the error?

Pig - passing Databag to UDF constructor

I have a script which is loading some data about venues:
venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);
Then I want to create UDF which has a constructor that is accepting venues type.
So I tried to define this UDF like that:
DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);
And here is the actual UDF:
public class GenerateVenues extends EvalFunc<Tuple> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
private static final String ALLCHARS = "(.*)";
private ArrayList<String> venues;
private String regex;
public GenerateVenues(DataBag venuesBag) {
Iterator<Tuple> it = venuesBag.iterator();
venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
String current = "";
regex = "";
while (it.hasNext()){
Tuple t = it.next();
try {
current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
venues.add((String) t.get(0));
} catch (ExecException e) {
throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
}
regex += current + (it.hasNext() ? "|" : "");
}
}
#Override
public Tuple exec(Tuple tuple) throws IOException {
// expect one string
if (tuple == null || tuple.size() != 2) {
throw new IllegalArgumentException(
"BagTupleExampleUDF: requires two input parameters.");
}
try {
String tweet = (String) tuple.get(0);
for (String venue: venues)
{
if (tweet.matches(ALLCHARS + venue + ALLCHARS))
{
Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
return output;
}
}
return null;
} catch (Exception e) {
throw new IOException(
"BagTupleExampleUDF: caught exception processing input.", e);
}
}
}
When executed the script is firing error at the DEFINE part just before (venues);:
2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60> mismatched input 'venues' expecting RIGHT_PAREN
Obviously I'm doing something wrong, can you help me out figuring out what's wrong.
Is it the UDF that cannot accept the venues relation as a parameter. Or the relation is not represented by DataBag like this public GenerateVenues(DataBag venuesBag)?
Thanks!
PS I'm using Pig version 0.11.1.1.3.0.0-107.
As #WinnieNicklaus already said, you can only pass strings to UDF constructors.
Having said that, the solution to your problem is using distributed cache, you need to override public List<String> getCacheFiles() to return a list of filenames that will be made available via distributed cache. With that, you can read the file as a local file and build your table.
The downside is that Pig has no initialization function, so you have to implement something like
private void init() {
if (!this.initialized) {
// read table
}
}
and then call that as the first thing from exec.
You can't use a relation as a parameter in a UDF constructor. Only strings can be passed as arguments, and if they are really of another type, you will have to parse them out in the constructor.

Passing custom parameters to a pig udf function in java

This is the way I am looking to process my data.. from pig..
A = Load 'data' ...
B = FOREACH A GENERATE my.udfs.extract(*);
or
B = FOREACH A GENERATE my.udfs.extract('flag');
So basically extract either has no arguments or takes an argument... 'flag'
On my udf side...
#Override
public DataBag exec(Tuple input) throws IOException {
//if flag == true
//do this
//else
// do that
}
Now how do i implement this in pig?
The preferred way is to use DEFINE.
,,Use DEFINE to specify a UDF function when:
...
The constructor for the
function takes string parameters. If you need to use different
constructor parameters for different calls to the function you will
need to create multiple defines – one for each parameter set"
E.g:
Given the following UDF:
public class Extract extends EvalFunc<String> {
private boolean flag;
public Extract(String flag) {
//Note that a boolean param cannot be passed from script/grunt
//therefore pass it as a string
this.flag = Boolean.valueOf(flag);
}
public Extract() {
}
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
if (flag) {
...
}
else {
...
}
}
catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}
}
Then
define ex_arg my.udfs.Extract('true');
define ex my.udfs.Extract();
...
B = foreach A generate ex_arg(); --calls extract with flag set to true
C = foreach A generate ex(); --calls extract without any flag set
Another option (hack?) :
In this case the UDF gets instantiated with its noarg constructor and you pass the flag you want to evaluate in its exec method. Since this method takes a tuple as a parameter you need to first check whether the first field is the boolean flag.
public class Extract extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
boolean flag = false;
if (input.getType(0) == DataType.BOOLEAN) {
flag = (Boolean) input.get(0);
}
//process rest of the fields in the tuple
if (flag) {
...
}
else {
...
}
}
catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}
}
Then
...
B = foreach A generate Extract2(true,*); --use flag
C = foreach A generate Extract2();
I'd rather stick to the first solution as this smells.