Is there a way to dynamically compute the input value to a LOAD statement in pig? Conceptually, I want to do something like this:
%declare MYINPUT com.foo.myMethod('2013-04-15');
raw = LOAD '$MYINPUT' ...
myMethod() is a UDF that accepts a date as input and returns a (comma-separated) list of directories as a string. That string is then given as the input to the LOAD statement.
Thanks.
It doesn't sound to me like myMethod() needs to be a UDF. Presuming this list of directories doesn't need to be computed in map reduce you could run the function to get the string first, then make it a property you pass to pig. Sample if your driver was in java provided below:
String myInput = myMethod("2013-04-15");
PigServer pig = new PigServer(ExecType.MAPREDUCE);
Map<String,String> myProperties = new HashMap<String,String>();
myProperties.put("myInput",myInput);
pig.registerScript("myScriptLocation.pig");
and then your script would start with
raw = LOAD '$myInput' USING...
this is assuming your myInput String is in a glob format PigStorage can read, or you have a different LoadFunc that can handle your comma separated string in mind.
I had a similar issue and opted for a Java LoadFunc implementation instead of a pre-processor. Using a custom LoadFunc means the script can still be run by analysts using the stock pig executable, and doesn't require another dependency.
Related
I'm currently running some performance tests and am having issues converting a string I have extracted from JSON into an int.
The problem I'm having is that I need this number which has been extracted as both an int and a string, its currently only a string and I don't see how I can create another variable where the number is an int.
Here is the JSON extractor I'm using
How can I have another variable which is an int?
By default JMeter stores values into JMeter Variables in String form, if you need to save it in Integer form as well you can do it using i.e. __groovy() function like:
${__groovy(vars.putObject('Fixture_ID_INT'\, vars.get('Fixture_ID') as int),)}
and access it where required like:
${__groovy(vars.getObject('Fixture_ID_INT'),)}
Demo:
More information: Apache Groovy - Why and How You Should Use It
you can use the below code
Integer.parseInt(vars.get("urs"));
I have PCollection<String> of type String and I want to transform this to get values of specific column from BigQuery table. So I used BigQueryIO.readTableRows to get values from BigQuery.
Here is my Code:
PCollection<TableRow> getConfigTable = pipeline.apply("read from Table",
BigQueryIO.readTableRows().from("TableName"));
RetrieveDestTableName retrieveDestTableName = new RetrieveDestTableName();
PCollection<String> getDestTableName = getConfigTable.apply(ParDo.of(new DoFn<String,String>(){
#ProcessElement
public void processElement(ProcessContext c){
c.output(c.element().get("ColoumnName").toString());
}
}));
As per above code I will get an output from getDestTableName of type PCollection<String> but I want this output in String variable.
Is there any way to convert PCollection<String> to String datatype variable so that I can able to use variable in my code?
Converting a PCollection<String> to a String is not possible in the Apache Beam programming model. A PCollection simply describes the state of the pipeline at any given point. During development, you do not have literal access to the strings in the PCollection.
You can process the strings in a PCollection through transforms. However, it seems like you need the table configuration to construct the rest of the pipeline. You'll need to know the destination ahead of time or you can use DynamicDestinations to determine which table to write to during pipeline execution. You cannot get the table configuration value from the PCollection and use it to further construct the pipeline.
It seems that you want something like JdbcIO.readAll() but for BigQuery, allowing the read configuration(s) to be dynamically computed by the pipeline. This is currently not implemented for BigQuery, but it'd be a reasonable request.
Meanwhile your options are:
Express what you're doing as a more complex BigQuery SQL query, and use a single BigQueryIO.read().fromQuery()
Express the part of your pipeline where you extract the table of interest without the Beam API, instead using the BigQuery API directly, so you are operating regular Java variables instead of PCollections.
I am trying to process a dataset with JSON data. However, the data have been written on a file without being parsed. That means that a python dictionary is written in the file as a string instead of a JSON object as a string.
I've found a module (AST) that will do the job to convert the string to a dictionary again using the ast.literal_eval function.
However, I am getting a very strange error in some of the instances:
The code reads from a text file and apply the following to each line:
ast.literal_eval(line.rstrip())
It seems some of the characters are not ok with the AST module.
Need to recall as well that this is not happening with all the dataset, just with some instances.
Any ideas?
Many thanks in advance.
Try exploring the json package. It is cleaner and more standard way of converting strings to dictionary
json.loads(inputStr) // Converts string -> dict
json.dumps(inputJson) // Converts dict -> string
Hope this helps. Cheers!
Let say in my Pig Script I just want to generate a summary by calling a UDF just once.
The UDF will actually take a map and inside it will properly format the map and will return a String.
Is there any way of calling this UDF just once, instead of calling it by
report = FOREACH dummyTuple GENERATE myUDF(myMap);
One way of doing is to generate a dummyTuple and limiting it to 1 and then calling the above.
i have a populated list:
def someList=... (string values)
and I want to pass this into a SQL statement to restrict which columns the query selects.
db.rows("select ${someList} from arch_application")
However, I get this error when I try to do so:
There is a ? parameter in the select list. This is not allowed.
Anyone have an ideas? Thanks!
When you pass a GString to Sql.rows, it gets parsed differently than normal in groovy. In particular, it creates a PreparedStatement with replaceable parameters for ${} substitutions. In your case this is probably not what you want. Try forcing the GString to a Java string:
db.rows("select ${someList.join(',')} from arch_application" as String)