PIG UDF to convert tuple to multiple tuple output

PIG UDF to convert tuple to multiple tuple output - apache-pig

I am new to PIG and I am trying to create a UDF which get a tuple and return multiple tuple based on a delimited. So I have written one UDF to read the below data file
2012/01/01 Name1 Category1|Category2|Category3
2012/01/01 Name2 Category2|Category3
2012/01/01 Name3 Category1|Category5
Basically i am trying to read $2 field
Category1|Category2|Category3
Category2|Category3
Category1|Category5
to get the output as :-
Category1, Category2, Category3
Category2, Category3
Category1, Category5
Below is the UDF code i have written..
package com.test.multipleTuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class TupleToMultipleTuple extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
// Keep the count of every cell in the
Tuple aux = TupleFactory.getInstance().newTuple();
if (input == null || input.size() == 0)
return null;
try {
String del = "\\|";
String str = (String) input.get(0);
String field[] = str.split(del);
for (String nxt : field) {
aux.append(nxt.trim().toString());
}
} catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
return aux.toDelimitedString(",");
}
}
created Jar --> TupleToMultipleTuple.jar
But I am getting the below error while executing it .
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias B
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias B
at org.apache.pig.PigServer.openIterator(PigServer.java:892)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:774)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:547)
at org.apache.pig.Main.main(Main.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
at org.apache.pig.PigServer.openIterator(PigServer.java:884)
... 13 more
Can you please help me in rectifying the issue. Thanks.
Pig script for applying the UDF..
REGISTER TupleToMultipleTuple.jar;
DEFINE myFunc com.test.multipleTuple.TupleToMultipleTuple();
A = load 'data.txt' USING PigStorage(' ');
B = foreach A generate myFunc($2);
dump B;

You can use the built-in split function like this:
flatten(STRSPLIT($2,'[|]',3))as(cat1:chararray,cat2:chararray,cat3:chararray)
and you will get 3 tuples named cat1, cat2 and cat2 typed as chararray and delimited by the current delimiter of the relation which they belong to.

Found the issue .. Issue was while parsing DayaByteArray to String.. used toString() to fix it
package com.test.multipleTuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class TupleToMultipleTuple extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
// Keep the count of every cell in the
Tuple aux = TupleFactory.getInstance().newTuple();
if (input == null || input.size() == 0)
return null;
try {
String del = "\\|";
String str = (String) input.get(0).toString();
String field[] = str.split(del);
for (String nxt : field) {
aux.append(nxt.trim().toString());
}
} catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
return aux.toDelimitedString(",");
}
}

Related

map.putArray raises illegal data type null with array data

I am using this helper functions in my android that I got from here basically to convert a ReadableMap object to a WritableMap object. For reasons I can't tell, it raises an error on a specific field:
import com.facebook.react.bridge.ReadableArray;
import com.facebook.react.bridge.ReadableMap;
import com.facebook.react.bridge.ReadableMapKeySetIterator;
import com.facebook.react.bridge.ReadableNativeArray;
import com.facebook.react.bridge.WritableMap;
import com.facebook.react.bridge.WritableNativeMap;
import com.facebook.react.bridge.Arguments;
import java.util.Map;
public static WritableMap convertReadableMapToWriteableMap(ReadableMap readableMap) {
WritableMap map = new WritableNativeMap();
ReadableMapKeySetIterator iterator = readableMap.keySetIterator();
while (iterator.hasNextKey()) {
String key = iterator.nextKey();
switch (readableMap.getType(key)) {
case String:
break;
case Array:
logger.log(Level.INFO, "Key info= " + key + "=>" + readableMap.getArray(key));
//fails here
map.putArray(key, readableMap.getArray(key));
break;
}
}
return map;
}
logging the column like this gives:
Key info= collection => [{"checkedOut":false,"choice":[40226],"collectionGroupId":1,"maxLimit":10}]
and for some documents, it is just [] (empty array)
The exact error message I am getting is
null
Illegal type provided
is the field not an array? Looking at the putArray method i see:
#Override
public void putArray(#NonNull String key, #Nullable ReadableArray value) {
Assertions.assertCondition(
value == null || value instanceof WritableNativeArray, "Illegal type provided");
putNativeArray(key, (WritableNativeArray) value);
}
the object is not null nor WritableNativeArray.

just in case anyone comes, ended up doing this (not sure if iti s the right way):
map.merge(readableMap)

Applying TRIM() in Pig for all fields in a tuple

I am loading a CSV file with 56 fields. I want to apply TRIM() function in Pig for all fields in the tuple.
I tried:
B = FOREACH A GENERATE TRIM(*);
But it fails with below error-
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching
function for org.apache.pig.builtin.TRIM as multiple or none of them
fit. Please use an explicit cast.
Please help. Thank you.

To Trim a tuple in the Pig, you should create a UDF. Register the UDF and apply the UDF with Foreach statement to the field of the tuple you want to trim. Below is the code for trimming the tuple with UDF.
public class StrTrim extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try {
String str = (String)input.get(0);
return str.trim();
}
catch(Exception e) {
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}

Hcatalog hive issue java.lang.IllegalArgumentException: URI: does not have a scheme

hi i am trying to do this hcatalog example from the following link.
http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-2-0/CDH4-Installation-Guide/cdh4ig_topic_19_6.html
i am getting the following exception when i run the job.
java.lang.IllegalArgumentException: URI: does not have a scheme
java class:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
import org.apache.hcatalog.common.*;
import org.apache.hcatalog.mapreduce.*;
import org.apache.hcatalog.data.*;
import org.apache.hcatalog.data.schema.*;
import org.apache.hadoop.util.GenericOptionsParser;
//import org.apache.commons.cli.Options;
public class UseHCat extends Configured implements Tool {
public static class Map extends Mapper<WritableComparable, HCatRecord, Text, IntWritable> {
String groupname;
#Override
protected void map( WritableComparable key,
HCatRecord value,
org.apache.hadoop.mapreduce.Mapper<WritableComparable, HCatRecord,
Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// The group table from /etc/group has name, 'x', id
groupname = (String) value.get(0);
int id = (Integer) value.get(2);
// Just select and emit the name and ID
context.write(new Text(groupname), new IntWritable(id));
}
}
public static class Reduce extends Reducer<Text, IntWritable,
WritableComparable, HCatRecord> {
protected void reduce( Text key,
java.lang.Iterable<IntWritable> values,
org.apache.hadoop.mapreduce.Reducer<Text, IntWritable,
WritableComparable, HCatRecord>.Context context)
throws IOException, InterruptedException {
// Only expecting one ID per group name
Iterator<IntWritable> iter = values.iterator();
IntWritable iw = iter.next();
int id = iw.get();
// Emit the group name and ID as a record
HCatRecord record = new DefaultHCatRecord(2);
record.set(0, key.toString());
record.set(1, id);
context.write(null, record);
}
}
#SuppressWarnings("deprecation")
public int run(String[] args) throws Exception {
Configuration conf = getConf();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
// Get the input and output table names as arguments
String inputTableName = args[0];
String outputTableName = args[1];
// Assume the default database
String dbName = "hadooppracticedb";
Job job = new Job(conf, "UseHCat");
HCatInputFormat.setInput(job, InputJobInfo.create(dbName,
inputTableName, null));
job.setJarByClass(UseHCat.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits a string as key and an integer as value
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// Ignore the key for the reducer output; emitting an HCatalog record as value
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
job.setOutputFormatClass(HCatOutputFormat.class);
HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName,
outputTableName, null));
HCatSchema s = HCatOutputFormat.getTableSchema(job);
System.err.println("INFO: output schema explicitly set for writing:" + s);
HCatOutputFormat.setSchema(job, s);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new UseHCat(), args);
System.exit(exitCode);
}
}
getting exception at this line
HCatInputFormat.setInput(job, InputJobInfo.create(dbName,
inputTableName, null));
hadoop jar command:
hadoop jar Hcat.jar com.otsi.hcat.UseHCat -files $HCATJAR -libjars ${LIBJARS} groups groupids
i have set the following property in hive-site.xml
hive-site.xml:
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
i have created the 2 tables groups groupids in "hadooppracticedb"..
please suggest.

You haven't defined metastore local property?
You should only set the hive.metastore.uris property if you are running a standalone MetaStore server, in which case you need to set hive.metastore.local=false and set hive.metastore.uris to a Thrift URI.
Please see this document for more details:
https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

Pig - passing Databag to UDF constructor

I have a script which is loading some data about venues:
venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);
Then I want to create UDF which has a constructor that is accepting venues type.
So I tried to define this UDF like that:
DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);
And here is the actual UDF:
public class GenerateVenues extends EvalFunc<Tuple> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
private static final String ALLCHARS = "(.*)";
private ArrayList<String> venues;
private String regex;
public GenerateVenues(DataBag venuesBag) {
Iterator<Tuple> it = venuesBag.iterator();
venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
String current = "";
regex = "";
while (it.hasNext()){
Tuple t = it.next();
try {
current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
venues.add((String) t.get(0));
} catch (ExecException e) {
throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
}
regex += current + (it.hasNext() ? "|" : "");
}
}
#Override
public Tuple exec(Tuple tuple) throws IOException {
// expect one string
if (tuple == null || tuple.size() != 2) {
throw new IllegalArgumentException(
"BagTupleExampleUDF: requires two input parameters.");
}
try {
String tweet = (String) tuple.get(0);
for (String venue: venues)
{
if (tweet.matches(ALLCHARS + venue + ALLCHARS))
{
Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
return output;
}
}
return null;
} catch (Exception e) {
throw new IOException(
"BagTupleExampleUDF: caught exception processing input.", e);
}
}
}
When executed the script is firing error at the DEFINE part just before (venues);:
2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60> mismatched input 'venues' expecting RIGHT_PAREN
Obviously I'm doing something wrong, can you help me out figuring out what's wrong.
Is it the UDF that cannot accept the venues relation as a parameter. Or the relation is not represented by DataBag like this public GenerateVenues(DataBag venuesBag)?
Thanks!
PS I'm using Pig version 0.11.1.1.3.0.0-107.

As #WinnieNicklaus already said, you can only pass strings to UDF constructors.
Having said that, the solution to your problem is using distributed cache, you need to override public List<String> getCacheFiles() to return a list of filenames that will be made available via distributed cache. With that, you can read the file as a local file and build your table.
The downside is that Pig has no initialization function, so you have to implement something like
private void init() {
if (!this.initialized) {
// read table
}
}
and then call that as the first thing from exec.

You can't use a relation as a parameter in a UDF constructor. Only strings can be passed as arguments, and if they are really of another type, you will have to parse them out in the constructor.

lead/lag function in apache PIG

Is there a function in apache PIG that's similar to Lead/Lag function in SQL? Or any pig function that can look back to previous row of record?

Yes, there is pre-defined functionality. See the Over() and Stitch() methods in Piggybank. Over() has examples listed in the documentation.

Here is an alternative:
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema;
public class GenericLag2 extends EvalFunc<Tuple>{
private List<String> lagObjects = null;
#Override
public Tuple exec(Tuple input) throws IOException {
if (lagObjects == null) {
lagObjects = new ArrayList<String>();
return null;
}
try {
Tuple output = TupleFactory.getInstance().newTuple(lagObjects.size());
for (int i = 0; i < lagObjects.size(); i++) {
output.set(i, lagObjects.get(i));
}
lagObjects.clear();
for (int i = 0; i < input.size(); i++) {
lagObjects.add(input.get(i).toString());
}
return output;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
#Override
public Schema outputSchema(Schema input) {
Schema tupleSchema = new Schema();
try {
for (int i = 0; i < input.size(); i++) {
tupleSchema.add(new FieldSchema("lag_" + i, DataType.CHARARRAY));
}
return new Schema(new FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), tupleSchema, DataType.TUPLE));
} catch (FrontendException e) {
e.printStackTrace();
return null;
}
}
}
I assume this would be faster, but I'm not sure, as you would have to do the following:
...
C = ORDER A BY important_order_by_field, second_important_order_by_field
D = FOREACH B GENERATE
important_order_by_field
,second_important_order_by_field
,...
,FLATTEN(LAG(
string_field_to_lag
,int_field_to_lag
,date_field_to_lag
))
;
E = FOREACH D GENERATE
important_order_by_field
,second_important_order_by_field
,...
,string_field_to_lag
,(int) int_field_to_lag
,(date_field_to_lag IS NULL ?
null :
ToDate(SUBSTRING(REPLACE(date_field_to_lag, 'T', ' '), 0, 19), 'yyyy-MM-dd HH:mm:ss'))
as date_field_to_lag
;
DUMP E;

Ok here is my first shot at this. Mind you, I just started learning how to code UDFs today.
Maven's pom.xml file contains:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.0.0-cdh4.1.0</version>
</dependency>
...
Java UDF Class:
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class GenericLag extends EvalFunc<String>{
private String lagObject = null;
#Override
public String exec(Tuple input) throws IOException {
try {
String returnObject = getLagObject();
setLagObject(input.get(0).toString());
return returnObject;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
public String getLagObject() {
return lagObject;
}
public void setLagObject(String lagObject) {
this.lagObject = lagObject;
}
}
Initially, I had used Object instead of String everywhere that you see "String" above, but I received this error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown
I had to issue setLagObject(input.get(0).toString()); instead of setLagObject(input.get(0); or I would have received errors like:
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String
java.lang.ClassCastException: org.joda.time.DateTime cannot be cast to java.lang.String
Here is how I use it in Pig:
REGISTER /path/to/compiled/file.jar
DEFINE LAG fully.qualified.domain.name.GenericLag();
A = LOAD '/hdfs/path/to/directory' USING PigStorage(',') AS (
important_order_by_field:int
,second_important_order_by_field:string
,...
,string_field_to_lag:string
,int_field_to_lag:int
,date_field_to_lag:string
);
B = FOREACH A GENERATE
important_order_by_field
,second_important_order_by_field
,...
,string_field_to_lag
,int_field_to_lag
,ToDate(date_field_to_lag, 'yyyy-MM-dd HH:mm:ss')
;
C = ORDER A BY important_order_by_field, second_important_order_by_field
D = FOREACH B GENERATE
important_order_by_field
,second_important_order_by_field
,...
,LAG(string_field_to_lag) AS lag_string
,(int) LAG(int_field_to_lag) AS lag_int
,(date_field_to_lag IS NULL ?
null :
ToDate(SUBSTRING(REPLACE(LAG(date_field_to_lag), 'T', ' ') ,0,19), 'yyyy-MM-dd HH:mm:ss')) AS lag_date
;
DUMP D;
If I did the last line like this:
ToDate(SUBSTRING(REPLACE(LAG(date_field_to_lag), 'T', ' ') ,0,19), 'yyyy-MM-dd HH:mm:ss') AS lag_date
It would return the following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias LAGGED_RHODES. Backend error : null
Which when checking the logs reveals:
java.lang.NullPointerException
at org.joda.time.format.DateTimeFormatterBuilder$NumberFormatter.parseInto(DateTimeFormatterBuilder.java:1200)
because the first row will contain a null value.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PIG UDF to convert tuple to multiple tuple output - apache-pig

You can use the built-in split function like this: flatten(STRSPLIT($2,'[|]',3))as(cat1:chararray,cat2:chararray,cat3:chararray) and you will get 3 tuples named cat1, cat2 and cat2 typed as chararray and delimited by the current delimiter of the relation which they belong to.

Related

map.putArray raises illegal data type null with array data

Applying TRIM() in Pig for all fields in a tuple

Hcatalog hive issue java.lang.IllegalArgumentException: URI: does not have a scheme

Pig - passing Databag to UDF constructor

lead/lag function in apache PIG

Categories

Resources