Pig - passing Databag to UDF constructor

Pig - passing Databag to UDF constructor - apache-pig

I have a script which is loading some data about venues:
venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);
Then I want to create UDF which has a constructor that is accepting venues type.
So I tried to define this UDF like that:
DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);
And here is the actual UDF:
public class GenerateVenues extends EvalFunc<Tuple> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
private static final String ALLCHARS = "(.*)";
private ArrayList<String> venues;
private String regex;
public GenerateVenues(DataBag venuesBag) {
Iterator<Tuple> it = venuesBag.iterator();
venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
String current = "";
regex = "";
while (it.hasNext()){
Tuple t = it.next();
try {
current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
venues.add((String) t.get(0));
} catch (ExecException e) {
throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
}
regex += current + (it.hasNext() ? "|" : "");
}
}
#Override
public Tuple exec(Tuple tuple) throws IOException {
// expect one string
if (tuple == null || tuple.size() != 2) {
throw new IllegalArgumentException(
"BagTupleExampleUDF: requires two input parameters.");
}
try {
String tweet = (String) tuple.get(0);
for (String venue: venues)
{
if (tweet.matches(ALLCHARS + venue + ALLCHARS))
{
Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
return output;
}
}
return null;
} catch (Exception e) {
throw new IOException(
"BagTupleExampleUDF: caught exception processing input.", e);
}
}
}
When executed the script is firing error at the DEFINE part just before (venues);:
2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60> mismatched input 'venues' expecting RIGHT_PAREN
Obviously I'm doing something wrong, can you help me out figuring out what's wrong.
Is it the UDF that cannot accept the venues relation as a parameter. Or the relation is not represented by DataBag like this public GenerateVenues(DataBag venuesBag)?
Thanks!
PS I'm using Pig version 0.11.1.1.3.0.0-107.

As #WinnieNicklaus already said, you can only pass strings to UDF constructors.
Having said that, the solution to your problem is using distributed cache, you need to override public List<String> getCacheFiles() to return a list of filenames that will be made available via distributed cache. With that, you can read the file as a local file and build your table.
The downside is that Pig has no initialization function, so you have to implement something like
private void init() {
if (!this.initialized) {
// read table
}
}
and then call that as the first thing from exec.

You can't use a relation as a parameter in a UDF constructor. Only strings can be passed as arguments, and if they are really of another type, you will have to parse them out in the constructor.

Related

Find invalid values from JsonReaderException in Json.Net

I'm running the code below to purposely throw JsonReaderException. It correctly gives the exception message of "Could not convert string to boolean: aaa. Path 'Active', line 3, position 17."
Is there any way to get the value that has failed the validation directly from the JsonReaderException so I don't have to parse the exception message?
string json = #"{
'Email': 'james#example.com',
'Active': 'aaa',
'CreatedDate': '2013-01-20T00:00:00Z',
'Roles': [
'User',
'Admin'
]
}";
try
{
Account account = JsonConvert.DeserializeObject<Account>(json);
Console.WriteLine(account.Email);
}
catch (JsonReaderException exc)
{
// Do Something
}

It appears that the offending value is not saved as a property in JsonReaderException. The only possible location for this value would be the Exception.Data dictionary, however Json.NET does not add anything here.
However, with some work you can leverage Json.NET's serialization error event handling functionality to directly access the bad value at the time the exception is thrown. First, define the following helper method and ErrorEventArgs subtype:
public class ErrorAndValueEventArgs : Newtonsoft.Json.Serialization.ErrorEventArgs
{
public object ReaderValue { get; } = null;
public ErrorAndValueEventArgs(object readerValue, object currentObject, ErrorContext errorContext) : base(currentObject, errorContext)
{
this.ReaderValue = readerValue;
}
}
public static partial class JsonExtensions
{
public static TRootObject Deserialize<TRootObject>(string json, EventHandler<ErrorAndValueEventArgs> error, JsonSerializerSettings settings = null)
{
using (var sr = new StringReader(json))
using (var jsonReader = new JsonTextReader(sr))
{
var serializer = JsonSerializer.CreateDefault(settings);
serializer.Error += (o, e) => error(o, new ErrorAndValueEventArgs(jsonReader.Value, e.CurrentObject, e.ErrorContext));
return serializer.Deserialize<TRootObject>(jsonReader);
}
}
}
Now you will be able to access the value of JsonReader.Value at the time the exception was thrown:
object errorValue = null;
try
{
Account account = JsonExtensions.Deserialize<Account>(json, (o, e) => errorValue = e.ReaderValue);
Console.WriteLine(account.Email);
}
catch (JsonException exc)
{
// Do Something
Console.WriteLine("Value at time of {0} = {1}, Data.Count = {2}.", exc.GetType().Name, errorValue, exc.Data.Count);
// Prints Value at time of JsonReaderException = aaa, Data.Count = 0.
}
Notes:
Since you must manually create your own JsonTextReader, you will need to have access to the JSON string (or Stream) for this approach to work. (This is true in the example shown in your question.)
A similar technique for capturing additional error information is shown in JsonSerializationException Parsing.
You might want to enhance ErrorAndValueEventArgs to also record JsonReader.TokenType. In cases where the reader is positioned at the beginning of a container (object or array) at the time an exception is thrown, JsonReader.Value will be null.
Demo fiddle here.

Passing custom parameters to a pig udf function in java

This is the way I am looking to process my data.. from pig..
A = Load 'data' ...
B = FOREACH A GENERATE my.udfs.extract(*);
or
B = FOREACH A GENERATE my.udfs.extract('flag');
So basically extract either has no arguments or takes an argument... 'flag'
On my udf side...
#Override
public DataBag exec(Tuple input) throws IOException {
//if flag == true
//do this
//else
// do that
}
Now how do i implement this in pig?

The preferred way is to use DEFINE.
,,Use DEFINE to specify a UDF function when:
...
The constructor for the
function takes string parameters. If you need to use different
constructor parameters for different calls to the function you will
need to create multiple defines – one for each parameter set"
E.g:
Given the following UDF:
public class Extract extends EvalFunc<String> {
private boolean flag;
public Extract(String flag) {
//Note that a boolean param cannot be passed from script/grunt
//therefore pass it as a string
this.flag = Boolean.valueOf(flag);
}
public Extract() {
}
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
if (flag) {
...
}
else {
...
}
}
catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}
}
Then
define ex_arg my.udfs.Extract('true');
define ex my.udfs.Extract();
...
B = foreach A generate ex_arg(); --calls extract with flag set to true
C = foreach A generate ex(); --calls extract without any flag set
Another option (hack?) :
In this case the UDF gets instantiated with its noarg constructor and you pass the flag you want to evaluate in its exec method. Since this method takes a tuple as a parameter you need to first check whether the first field is the boolean flag.
public class Extract extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
boolean flag = false;
if (input.getType(0) == DataType.BOOLEAN) {
flag = (Boolean) input.get(0);
}
//process rest of the fields in the tuple
if (flag) {
...
}
else {
...
}
}
catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}
}
Then
...
B = foreach A generate Extract2(true,*); --use flag
C = foreach A generate Extract2();
I'd rather stick to the first solution as this smells.

Store and retrieve string arrays in HBase

I've read this answer (How to store complex objects into hadoop Hbase?) regarding the storing of string arrays with HBase.
There it is said to use the ArrayWritable Class to serialize the array. With WritableUtils.toByteArray(Writable ... writable) I'll get a byte[] which I can store in HBase.
When I now try to retrieve the rows again, I get a byte[] which I have somehow to transform back again into an ArrayWritable.
But I don't find a way to do this. Maybe you know an answer or am I doing fundamentally wrong serializing my String[]?

You may apply the following method to get back the ArrayWritable (taken from my earlier answer, see here) .
public static <T extends Writable> T asWritable(byte[] bytes, Class<T> clazz)
throws IOException {
T result = null;
DataInputStream dataIn = null;
try {
result = clazz.newInstance();
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
dataIn = new DataInputStream(in);
result.readFields(dataIn);
}
catch (InstantiationException e) {
// should not happen
assert false;
}
catch (IllegalAccessException e) {
// should not happen
assert false;
}
finally {
IOUtils.closeQuietly(dataIn);
}
return result;
}
This method just deserializes the byte array to the correct object type, based on the provided class type token.
E.g:
Let's assume you have a custom ArrayWritable:
public class TextArrayWritable extends ArrayWritable {
public TextArrayWritable() {
super(Text.class);
}
}
Now you issue a single HBase get:
...
Get get = new Get(row);
Result result = htable.get(get);
byte[] value = result.getValue(family, qualifier);
TextArrayWritable tawReturned = asWritable(value, TextArrayWritable.class);
Text[] texts = (Text[]) tawReturned.toArray();
for (Text t : texts) {
System.out.print(t + " ");
}
...
Note:
You may have already found the readCompressedStringArray() and writeCompressedStringArray() methods in WritableUtils
which seem to be suitable if you have your own String array-backed Writable class.
Before using them, I'd warn you that these can cause serious performance hit due to
the overhead caused by the gzip compression/decompression.

Mono.CSharp: how do I inject a value/entity into a script?

Just came across the latest build of Mono.CSharp and love the promise it offers.
Was able to get the following all worked out:
namespace XAct.Spikes.Duo
{
class Program
{
static void Main(string[] args)
{
CompilerSettings compilerSettings = new CompilerSettings();
compilerSettings.LoadDefaultReferences = true;
Report report = new Report(new Mono.CSharp.ConsoleReportPrinter());
Mono.CSharp.Evaluator e;
e= new Evaluator(compilerSettings, report);
//IMPORTANT:This has to be put before you include references to any assemblies
//our you;ll get a stream of errors:
e.Run("using System;");
//IMPORTANT:You have to reference the assemblies your code references...
//...including this one:
e.Run("using XAct.Spikes.Duo;");
//Go crazy -- although that takes time:
//foreach (Assembly assembly in AppDomain.CurrentDomain.GetAssemblies())
//{
// e.ReferenceAssembly(assembly);
//}
//More appropriate in most cases:
e.ReferenceAssembly((typeof(A).Assembly));
//Exception due to no semicolon
//e.Run("var a = 1+3");
//Doesn't set anything:
//e.Run("a = 1+3;");
//Works:
//e.ReferenceAssembly(typeof(A).Assembly);
e.Run("var a = 1+3;");
e.Run("A x = new A{Name=\"Joe\"};");
var a = e.Evaluate("a;");
var x = e.Evaluate("x;");
//Not extremely useful:
string check = e.GetVars();
//Note that you have to type it:
Console.WriteLine(((A) x).Name);
e = new Evaluator(compilerSettings, report);
var b = e.Evaluate("a;");
}
}
public class A
{
public string Name { get; set; }
}
}
And that was fun...can create a variable in the script's scope, and export the value.
There's just one last thing to figure out... how can I get a value in (eg, a domain entity that I want to apply a Rule script on), without using a static (am thinking of using this in a web app)?
I've seen the use compiled delegates -- but that was for the previous version of Mono.CSharp, and it doesn't seem to work any longer.
Anybody have a suggestion on how to do this with the current version?
Thanks very much.
References:
* Injecting a variable into the Mono.CSharp.Evaluator (runtime compiling a LINQ query from string)
* http://naveensrinivasan.com/tag/mono/

I know it's almost 9 years later, but I think I found a viable solution to inject local variables. It is using a static variable but can still be used by multiple evaluators without collision.
You can use a static Dictionary<string, object> which holds the reference to be injected. Let's say we are doing all this from within our class CsharpConsole:
public class CsharpConsole {
public static Dictionary<string, object> InjectionRepository {get; set; } = new Dictionary<string, object>();
}
The idea is to temporarily place the value in there with a GUID as key so there won't be any conflict between multiple evaluator instances. To inject do this:
public void InjectLocal(string name, object value, string type=null) {
var id = Guid.NewGuid().ToString();
InjectionRepository[id] = value;
type = type ?? value.GetType().FullName;
// note for generic or nested types value.GetType().FullName won't return a compilable type string, so you have to set the type parameter manually
var success = _evaluator.Run($"var {name} = ({type})MyNamespace.CsharpConsole.InjectionRepository[\"{id}\"];");
// clean it up to avoid memory leak
InjectionRepository.Remove(id);
}
Also for accessing local variables there is a workaround using Reflection so you can have a nice [] accessor with get and set:
public object this[string variable]
{
get
{
FieldInfo fieldInfo = typeof(Evaluator).GetField("fields", BindingFlags.NonPublic | BindingFlags.Instance);
if (fieldInfo != null)
{
var fields = fieldInfo.GetValue(_evaluator) as Dictionary<string, Tuple<FieldSpec, FieldInfo>>;
if (fields != null)
{
if (fields.TryGetValue(variable, out var tuple) && tuple != null)
{
var value = tuple.Item2.GetValue(_evaluator);
return value;
}
}
}
return null;
}
set
{
InjectLocal(variable, value);
}
}
Using this trick, you can even inject delegates and functions that your evaluated code can call from within the script. For instance, I inject a print function which my code can call to ouput something to the gui console window:
public delegate void PrintFunc(params object[] o);
public void puts(params object[] o)
{
// call the OnPrint event to redirect the output to gui console
if (OnPrint!=null)
OnPrint(string.Join("", o.Select(x => (x ?? "null").ToString() + "\n").ToArray()));
}
This puts function can now be easily injected like this:
InjectLocal("puts", (PrintFunc)puts, "CsInterpreter2.PrintFunc");
And just be called from within your scripts:
puts(new object[] { "hello", "world!" });
Note, there is also a native function print but it directly writes to STDOUT and redirecting individual output from multiple console windows is not possible.

how to parse non-string values in Opencsv HeaderColumnNameMappingStrategy

I'm using a HeaderColumnNameMappingStrategy to map a csv file with a header into a JavaBean. String values parse fine but any "true" or "false" value in csv doesn't map to JavaBean and I get the following exception from the PropertyDescriptor:
java.lang.IllegalArgumentException: argument type mismatch
The code where it occurs is in CsvToBean, line 64:
protected T processLine(MappingStrategy<T> mapper, String[] line) throws
IllegalAccessException, InvocationTargetException, InstantiationException, IntrospectionException {
T bean = mapper.createBean();
for(int col = 0; col < line.length; col++) {
String value = line[col];
PropertyDescriptor prop = mapper.findDescriptor(col);
if (null != prop) {
Object obj = convertValue(value, prop);
// this is where exception is thrown for a "true" value in csv
prop.getWriteMethod().invoke(bean, new Object[] {obj});
}
}
return bean;
}
protected PropertyEditor getPropertyEditor(PropertyDescriptor desc) throws
InstantiationException, IllegalAccessException {
Class<?> cls = desc.getPropertyEditorClass();
if (null != cls) return (PropertyEditor) cls.newInstance();
return getPropertyEditorValue(desc.getPropertyType());
}
I can confirm (via debugger) that the setter method id correctly retrieved at this point.
The problem occurs in desc.getPropertyEditorClass() since it returns null. I assumed primitive types and its wrappers are supported. Are they not?

I've run into this same issue. The cleanest way is probably to override getPropertyEditor like pritam did above and return a custom PropertyEditor for your particular object. The quick and dirty way would be to override convertValue in anonymous class form, like this:
CsvToBean<MyClass> csvToBean = new CsvToBean<MyClass>(){
#Override
protected Object convertValue(String value, PropertyDescriptor prop) throws InstantiationException,IllegalAccessException {
if (prop.getName().equals("myWhatever")) {
// return an custom object based on the incoming value
return new MyWhatever((String)value);
}
return super.convertValue(value, prop);
}
};
This is working fine for me with OpenCSV 2.3. Good luck!

I resolved this by extending CsvToBean and adding my own PropertyEditors. Turns out opencsv just supports primitive types and no wrappers.

Pritam's answer is great and this is a sample for dealing with datetime format.
PropertyEditorManager.registerEditor(java.util.Date.class, DateEditor.class);
You should write your own editor class like this:
public class DateEditor extends PropertyEditorSupport{
public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
#Override
public void setAsText(String text){
setValue(sdf.parse(text));}
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pig - passing Databag to UDF constructor - apache-pig

You can't use a relation as a parameter in a UDF constructor. Only strings can be passed as arguments, and if they are really of another type, you will have to parse them out in the constructor.

Related

Find invalid values from JsonReaderException in Json.Net

Passing custom parameters to a pig udf function in java

Store and retrieve string arrays in HBase

Mono.CSharp: how do I inject a value/entity into a script?

how to parse non-string values in Opencsv HeaderColumnNameMappingStrategy

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pig - passing Databag to UDF constructor - apache-pig

You can't use a relation as a parameter in a UDF constructor. Only strings can be passed as arguments, and if they are really of another type, you will have to parse them out in the constructor.

Related

Find invalid values from JsonReaderException in Json.Net

Passing custom parameters to a pig udf function in java

Store and retrieve string arrays in HBase

Mono.CSharp: how do I inject a value/entity *into* a script?

how to parse non-string values in Opencsv HeaderColumnNameMappingStrategy

Categories

Resources

Mono.CSharp: how do I inject a value/entity into a script?