trouble creating a pig udf schema - apache-pig

Trying to parse xml and I'm having trouble with my UDF returning a tuple. Following the example from http://verboselogging.com/2010/03/31/writing-user-defined-functions-for-pig
pig script
titles = FOREACH programs GENERATE (px.pig.udf.PARSE_KEYWORDS(program))
AS (root_id:chararray, keyword:chararray);
here is the output schema code:
override def outputSchema(input: Schema): Schema = {
try {
val s: Schema = new Schema
s.add(new Schema.FieldSchema("root_id", DataType.CHARARRAY))
s.add(new Schema.FieldSchema("keyword", DataType.CHARARRAY))
return s
}
catch {
case e: Exception => {
return null
}
}
}
I'm getting this error
pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException:
ERROR 0: Given UDF returns an improper Schema.
Schema should only contain one field of a Tuple, Bag, or a single type.
Returns: {root_id: chararray,keyword: chararray}
Update Final Solution:
In java
public Schema outputSchema(Schema input) {
try {
Schema tupleSchema = new Schema();
tupleSchema.add(input.getField(1));
tupleSchema.add(input.getField(0));
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),tupleSchema, DataType.TUPLE));
} catch (Exception e) {
return null;
}
}

You will need to add your s schema instance variable to another Schema object.
Try returning a new Schema(new FieldSchema(..., input), s, DataType.TUPLE)); like in the template below:
Here is my answer in Java (fill out your variable names):
#Override
public Schema outputSchema(Schema input) {
Schema tupleSchema = new Schema();
try {
tupleSchema.add(new FieldSchema("root_id", DataType.CHARARRAY));
tupleSchema.add(new FieldSchema("keyword", DataType.CHARARRAY));
return new Schema(new FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), tupleSchema, DataType.TUPLE));
} catch (FrontendException e) {
e.printStackTrace();
return null;
}
}
Would you try:
titles = FOREACH programs GENERATE (px.pig.udf.PARSE_KEYWORDS(program));
If that doesn't error, then try:
titles = FOREACH TITLES GENERATE
$0 AS root_id
,$1 AS keyword
;
And tell me the error?

Related

Find invalid values from JsonReaderException in Json.Net

I'm running the code below to purposely throw JsonReaderException. It correctly gives the exception message of "Could not convert string to boolean: aaa. Path 'Active', line 3, position 17."
Is there any way to get the value that has failed the validation directly from the JsonReaderException so I don't have to parse the exception message?
string json = #"{
'Email': 'james#example.com',
'Active': 'aaa',
'CreatedDate': '2013-01-20T00:00:00Z',
'Roles': [
'User',
'Admin'
]
}";
try
{
Account account = JsonConvert.DeserializeObject<Account>(json);
Console.WriteLine(account.Email);
}
catch (JsonReaderException exc)
{
// Do Something
}
It appears that the offending value is not saved as a property in JsonReaderException. The only possible location for this value would be the Exception.Data dictionary, however Json.NET does not add anything here.
However, with some work you can leverage Json.NET's serialization error event handling functionality to directly access the bad value at the time the exception is thrown. First, define the following helper method and ErrorEventArgs subtype:
public class ErrorAndValueEventArgs : Newtonsoft.Json.Serialization.ErrorEventArgs
{
public object ReaderValue { get; } = null;
public ErrorAndValueEventArgs(object readerValue, object currentObject, ErrorContext errorContext) : base(currentObject, errorContext)
{
this.ReaderValue = readerValue;
}
}
public static partial class JsonExtensions
{
public static TRootObject Deserialize<TRootObject>(string json, EventHandler<ErrorAndValueEventArgs> error, JsonSerializerSettings settings = null)
{
using (var sr = new StringReader(json))
using (var jsonReader = new JsonTextReader(sr))
{
var serializer = JsonSerializer.CreateDefault(settings);
serializer.Error += (o, e) => error(o, new ErrorAndValueEventArgs(jsonReader.Value, e.CurrentObject, e.ErrorContext));
return serializer.Deserialize<TRootObject>(jsonReader);
}
}
}
Now you will be able to access the value of JsonReader.Value at the time the exception was thrown:
object errorValue = null;
try
{
Account account = JsonExtensions.Deserialize<Account>(json, (o, e) => errorValue = e.ReaderValue);
Console.WriteLine(account.Email);
}
catch (JsonException exc)
{
// Do Something
Console.WriteLine("Value at time of {0} = {1}, Data.Count = {2}.", exc.GetType().Name, errorValue, exc.Data.Count);
// Prints Value at time of JsonReaderException = aaa, Data.Count = 0.
}
Notes:
Since you must manually create your own JsonTextReader, you will need to have access to the JSON string (or Stream) for this approach to work. (This is true in the example shown in your question.)
A similar technique for capturing additional error information is shown in JsonSerializationException Parsing.
You might want to enhance ErrorAndValueEventArgs to also record JsonReader.TokenType. In cases where the reader is positioned at the beginning of a container (object or array) at the time an exception is thrown, JsonReader.Value will be null.
Demo fiddle here.

Checking if a table exists in BigQuery Java

I'm trying to write a function to check whether a table exists or not in BigQuery. The following code always returns true. Where is the problem?
Thanks!
private static boolean checkTableExist() {
try {
BigQueryOptions.Builder optionsBuilder = BigQueryOptions.newBuilder();
BigQuery bigquery = optionsBuilder.build().getService();
bigquery.getTable(options.getBigQueryDatasetId(), options.getBigQueryTableId());
} catch (Exception e) {
return false;
}
return true;
}
I don't think you should rely on java Exception to test a boolean condition.
I haven't looked a lot at the getTable() method, but here is how I check if a table exists:
public boolean isExisting() {
return getDataset().get(tableName) != null;
}
protected Dataset getDataset() {
return bigQuery.getDataset(dataSetName);
}
Try this:
if (bigquery.getDataset(datasetName).get(tableName).exists()) {
// table exists
} else {
// table does not exist in BQ dataset
}

Pig - passing Databag to UDF constructor

I have a script which is loading some data about venues:
venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);
Then I want to create UDF which has a constructor that is accepting venues type.
So I tried to define this UDF like that:
DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);
And here is the actual UDF:
public class GenerateVenues extends EvalFunc<Tuple> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
private static final String ALLCHARS = "(.*)";
private ArrayList<String> venues;
private String regex;
public GenerateVenues(DataBag venuesBag) {
Iterator<Tuple> it = venuesBag.iterator();
venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
String current = "";
regex = "";
while (it.hasNext()){
Tuple t = it.next();
try {
current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
venues.add((String) t.get(0));
} catch (ExecException e) {
throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
}
regex += current + (it.hasNext() ? "|" : "");
}
}
#Override
public Tuple exec(Tuple tuple) throws IOException {
// expect one string
if (tuple == null || tuple.size() != 2) {
throw new IllegalArgumentException(
"BagTupleExampleUDF: requires two input parameters.");
}
try {
String tweet = (String) tuple.get(0);
for (String venue: venues)
{
if (tweet.matches(ALLCHARS + venue + ALLCHARS))
{
Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
return output;
}
}
return null;
} catch (Exception e) {
throw new IOException(
"BagTupleExampleUDF: caught exception processing input.", e);
}
}
}
When executed the script is firing error at the DEFINE part just before (venues);:
2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60> mismatched input 'venues' expecting RIGHT_PAREN
Obviously I'm doing something wrong, can you help me out figuring out what's wrong.
Is it the UDF that cannot accept the venues relation as a parameter. Or the relation is not represented by DataBag like this public GenerateVenues(DataBag venuesBag)?
Thanks!
PS I'm using Pig version 0.11.1.1.3.0.0-107.
As #WinnieNicklaus already said, you can only pass strings to UDF constructors.
Having said that, the solution to your problem is using distributed cache, you need to override public List<String> getCacheFiles() to return a list of filenames that will be made available via distributed cache. With that, you can read the file as a local file and build your table.
The downside is that Pig has no initialization function, so you have to implement something like
private void init() {
if (!this.initialized) {
// read table
}
}
and then call that as the first thing from exec.
You can't use a relation as a parameter in a UDF constructor. Only strings can be passed as arguments, and if they are really of another type, you will have to parse them out in the constructor.

Passing custom parameters to a pig udf function in java

This is the way I am looking to process my data.. from pig..
A = Load 'data' ...
B = FOREACH A GENERATE my.udfs.extract(*);
or
B = FOREACH A GENERATE my.udfs.extract('flag');
So basically extract either has no arguments or takes an argument... 'flag'
On my udf side...
#Override
public DataBag exec(Tuple input) throws IOException {
//if flag == true
//do this
//else
// do that
}
Now how do i implement this in pig?
The preferred way is to use DEFINE.
,,Use DEFINE to specify a UDF function when:
...
The constructor for the
function takes string parameters. If you need to use different
constructor parameters for different calls to the function you will
need to create multiple defines – one for each parameter set"
E.g:
Given the following UDF:
public class Extract extends EvalFunc<String> {
private boolean flag;
public Extract(String flag) {
//Note that a boolean param cannot be passed from script/grunt
//therefore pass it as a string
this.flag = Boolean.valueOf(flag);
}
public Extract() {
}
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
if (flag) {
...
}
else {
...
}
}
catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}
}
Then
define ex_arg my.udfs.Extract('true');
define ex my.udfs.Extract();
...
B = foreach A generate ex_arg(); --calls extract with flag set to true
C = foreach A generate ex(); --calls extract without any flag set
Another option (hack?) :
In this case the UDF gets instantiated with its noarg constructor and you pass the flag you want to evaluate in its exec method. Since this method takes a tuple as a parameter you need to first check whether the first field is the boolean flag.
public class Extract extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
boolean flag = false;
if (input.getType(0) == DataType.BOOLEAN) {
flag = (Boolean) input.get(0);
}
//process rest of the fields in the tuple
if (flag) {
...
}
else {
...
}
}
catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}
}
Then
...
B = foreach A generate Extract2(true,*); --use flag
C = foreach A generate Extract2();
I'd rather stick to the first solution as this smells.

SAP JCo RETURN Table empty when using TransactionID

I'm using the JCo Library to access SAP standard BAPI. Well everything is also working except that the RETURN Table is always empty when I use the TID (TransactionID).
When I just remove the TID, I get the RETURN table filled with Warnings etc. But unfortunately I need to use the TID for the transactional BAPI, otherwise the changes are not commited.
Why is the RETURN TABLE empty when using TID?
Or how must I commit changes to a transactional BAPI?
Here speudo-code of a BAPI access:
import com.sap.conn.jco.*;
import org.apache.commons.logging.*;
public class BapiSample {
private static final Log logger = LogFactory.getLog(BapiSample.class);
private static final String CLIENT = "400";
private static final String INSTITUTION = "1000";
protected JCoDestination destination;
public BapiSample() {
this.destination = getDestination("mySAPConfig.properties");
}
public void execute() {
String tid = null;
try {
tid = destination.createTID();
JCoFunction function = destination.getRepository().getFunction("BAPI_PATCASE_CHANGEOUTPATVISIT");
function.getImportParameterList().setValue("CLIENT", CLIENT);
function.getImportParameterList().setValue("INSTITUTION", INSTITUTION);
function.getImportParameterList().setValue("MOVEMNT_SEQNO", "0001");
// Here we will then all parameters of the BAPI....
// ...
// Now the execute
function.execute(destination, tid);
// And getting the RETURN Table. !!! THIS IS ALWAYS EMPTY!
JCoTable returnTable = function.getTableParameterList().getTable("RETURN");
int numRows = returnTable.getNumRows();
for (int i = 0; i < numRows; i++) {
returnTable.setRow(i);
logger.info("RETURN VALUE: " + returnTable.getString("MESSAGE"));
}
JCoFunction commit = destination.getRepository().getFunction("BAPI_TRANSACTION_COMMIT");
commit.execute(destination, tid);
destination.confirmTID(tid);
} catch (Throwable ex) {
try {
if (destination != null) {
JCoFunction rollback = destination.getRepository().getFunction("BAPI_TRANSACTION_ROLLBACK");
rollback.execute(destination, tid);
}
} catch (Throwable t1) {
}
}
}
protected static JCoDestination getDestination(String fileName) {
JCoDestination result = null;
try {
result = JCoDestinationManager.getDestination(fileName);
} catch (Exception ex) {
logger.error("Error during destination resolution", ex);
}
return result;
}
}
UPDATE 10.01.2013: I was finally able to get both, RETURN table filled and Inputs commited. Solution is to do just both, a commit without TID, get the RETURN table and then making again a commit with TID.
Very very strange, but maybe the correct usage of the JCo Commits. Can someone explain this to me?
I was able to get both, RETURN table filled and Inputs commited.
Solution is to do just both, a commit without TID, get the RETURN table and then making again a commit with TID.
You should not call execute method 2 times it will incremenmt sequence number
You should use begin and end method in JCoContext class.
If you call begin method at the beginning of the process, the data will be updated and message will be returned.
Here is the sample code.
JCoDestination destination = JCoDestinationManager.getDestination("");
try
{
JCoContext.begin(destination);
function.execute(destination)
function.execute(destination)
}
catch (AbapException ex)
{
...
}
catch (JCoException ex)
{
...
}
catch (Exception ex)
{
...
}
finally
{
JCoContext.end(destination);
}
you can reffer the further information from this URL.
http://www.finereporthelp.com/download/SAP/sapjco3_linux_32bit/javadoc/com/sap/conn/jco/JCoContext.html