Let say in my Pig Script I just want to generate a summary by calling a UDF just once.
The UDF will actually take a map and inside it will properly format the map and will return a String.
Is there any way of calling this UDF just once, instead of calling it by
report = FOREACH dummyTuple GENERATE myUDF(myMap);
One way of doing is to generate a dummyTuple and limiting it to 1 and then calling the above.
Related
I have a UDF defined like so:
def my_function(input: Array[Byte])
and I want to call it in spark SQL, so i'm trying
SELECT my_function(binary(CONCAT(*))) FROM table;
but I don't think this is working. To my understanding, select * will return Array[Row], and then calling the native function binary will serialize that. Will that convert Array[Row] to Array[Byte]? Not sure how to call this udf via sql
We have to register the function and then we can use the UDF
ie
spark.udf.register(funname and definition )
you can explore more on link
I have PCollection<String> of type String and I want to transform this to get values of specific column from BigQuery table. So I used BigQueryIO.readTableRows to get values from BigQuery.
Here is my Code:
PCollection<TableRow> getConfigTable = pipeline.apply("read from Table",
BigQueryIO.readTableRows().from("TableName"));
RetrieveDestTableName retrieveDestTableName = new RetrieveDestTableName();
PCollection<String> getDestTableName = getConfigTable.apply(ParDo.of(new DoFn<String,String>(){
#ProcessElement
public void processElement(ProcessContext c){
c.output(c.element().get("ColoumnName").toString());
}
}));
As per above code I will get an output from getDestTableName of type PCollection<String> but I want this output in String variable.
Is there any way to convert PCollection<String> to String datatype variable so that I can able to use variable in my code?
Converting a PCollection<String> to a String is not possible in the Apache Beam programming model. A PCollection simply describes the state of the pipeline at any given point. During development, you do not have literal access to the strings in the PCollection.
You can process the strings in a PCollection through transforms. However, it seems like you need the table configuration to construct the rest of the pipeline. You'll need to know the destination ahead of time or you can use DynamicDestinations to determine which table to write to during pipeline execution. You cannot get the table configuration value from the PCollection and use it to further construct the pipeline.
It seems that you want something like JdbcIO.readAll() but for BigQuery, allowing the read configuration(s) to be dynamically computed by the pipeline. This is currently not implemented for BigQuery, but it'd be a reasonable request.
Meanwhile your options are:
Express what you're doing as a more complex BigQuery SQL query, and use a single BigQueryIO.read().fromQuery()
Express the part of your pipeline where you extract the table of interest without the Beam API, instead using the BigQuery API directly, so you are operating regular Java variables instead of PCollections.
I need to process all the files in a folder.
something like this:
foreach loop over n
fileprocess.exe -argument_n filename_n
each argument is file specific and will be retrieved from a table.
Need to design ssis package to batch process the files.
foreach loop seems perfect for it.
I'm thinking of using lookup transform to retrieve the corresponding argument.
My question are
how to feed the variable #[user::filename] to the lookup transform?
how to assign the lookup output into #[user::argument]
Wonder if lookup transform is the right one to use?
Thanks a lot!
Assuming that you have a table containing columns for file name and corresponding argument, one way to implement your requirement is as below:
Add components from below figure to the Control Flow.
In the Foreach Loop, Enumerator is set to Foreach File Enumerator. The files are read from a folder, but you could use any type of enumerator.
Create 2 variables in the scope of the Loop to hold the file name and arguments, say, fname and farg. In the Collections tab of Foreach Loop Editor, assign index 0 to the variable fname.
In the Execute SQL task, write a query to retrieve the arguments for a given filename. Assign variable fname as input parameter, set Result Set to Single Row, and assign the result to variable farg.
In the Execute Process task, pass the variables fname and farg as arguments for your executable.
I realize the question is confusing. I'm trying to reference many widgets that were created in the main loop of my script from a secondary function using e.parameter.
Instead of referencing each e.parameter separately, by its name, I'd like to be able to make one reference to e.parameter and have the parameter name portion be a globally defined variable.
As in:
e.parameter.some_id
Would be the same as:
var test=[]
test[0]='some_id'
e.parameter.(test[0])
Or some other syntax. I'm trying to reference the parameters in a loop, and using the array means I can increment a for loop counter instead doing if tests for each parameter individually.
I'm certain there's an easier way to do this, but I'm still new to java.
Use e.parameter[test[0]] . It is not java but JavaScript
Is there a way to dynamically compute the input value to a LOAD statement in pig? Conceptually, I want to do something like this:
%declare MYINPUT com.foo.myMethod('2013-04-15');
raw = LOAD '$MYINPUT' ...
myMethod() is a UDF that accepts a date as input and returns a (comma-separated) list of directories as a string. That string is then given as the input to the LOAD statement.
Thanks.
It doesn't sound to me like myMethod() needs to be a UDF. Presuming this list of directories doesn't need to be computed in map reduce you could run the function to get the string first, then make it a property you pass to pig. Sample if your driver was in java provided below:
String myInput = myMethod("2013-04-15");
PigServer pig = new PigServer(ExecType.MAPREDUCE);
Map<String,String> myProperties = new HashMap<String,String>();
myProperties.put("myInput",myInput);
pig.registerScript("myScriptLocation.pig");
and then your script would start with
raw = LOAD '$myInput' USING...
this is assuming your myInput String is in a glob format PigStorage can read, or you have a different LoadFunc that can handle your comma separated string in mind.
I had a similar issue and opted for a Java LoadFunc implementation instead of a pre-processor. Using a custom LoadFunc means the script can still be run by analysts using the stock pig executable, and doesn't require another dependency.