Dynamically set input and output paths in pig using UDFs

Dynamically set input and output paths in pig using UDFs - apache-pig

I would like to create a no-arg pig script that dynamically creates input and output paths.
The script itself should determine a input file glob based on current date and similarly determine an output file path based on current date. While I know that one can easily pass in parameters I was hoping to have a no-arg script and use a couple of simple jython UDFs to compute these paths.
How do I do that? I can't seem to set variables by calling a UDF. For instance,
%default OUTPUTPATH myfn();
or
path = myfn();
don't seem to work.
Any ideas?
(Why no-args? Because I would like to have a single static amazon data pipeline config that runs the same script each day but under the hood would run the last day's or last week's worth of log files each time.)

Sadly, to my knowledge, there is no way to do this in pure pig. However, you can define these changing variables in a python wrapper. In your case, you'll just define the dict of args like:
d = {
'OUTPATH': myfn(),
}
And then pass that dict like:
P = Pig.compile(path_to_my_script)
Q = P.bind(d)
results = Q.run()
Of course there is a little more to add to the wrapper, but it should be pretty clear from the docs.

Related

How to define a Combitimetable through a script in Dymola?

I am trying to perform several simulations in a sequence using a for loop in a script. From simulation to simulation, the only variable to change is the file path of a Combitimetable.
I propagated the variable fileName in order to assign a new path in each iteration. However, when the model reads the extension, changes the timeScale and the resolution is lower than needed. I tried to propagate timeScale too, but without changes. Is there a function to define the Combitimetable variables? Is my only alternative to merge all tables and split the results manually?
Example of the script on a single run (without the for loop):
filePath="RL_30_200g";
dymolaPath = "modelica://customTILComponents/Combitables/Combitimetable_"+filePath+".txt";
fileName= ModelicaServices.ExternalReferences.loadResource(dymolaPath);
result ="Full_Year_Simulation_"+filePath;
timeScale = 1/3600;
translateModel ("customTILComponents.MA_Santoro.FullModels.OptiHorst_FullModel_New_Year_Simulation_Batch");
simulateModel(startTime=0,stopTime=8860,numberOfIntervals=300,method="Dassl",tolerance=0.000001,resultFile=result);

I am not sure where your problem is and how you change fileName. In your question timeScale is also not used anywhere. Anyway, here is how I would do it: add a parameter to your model for fileName. Since it is a string, the only way to change it is via a modifier which can be included in the model name of the simulateModel command.
Here is an example: In your model with the time table, propagate the parameter fileName:
model MyModel
parameter String fileName="NoName" "File where matrix is stored";
Modelica.Blocks.Tables.CombiTable1Ds combiTable1Ds(
tableOnFile=true,
tableName="x",
fileName=fileName) annotation (Placement(transformation(extent={{-10,-10},{10,10}})));
Modelica.Blocks.Sources.Ramp ramp(duration=1) annotation (Placement(transformation(extent={{-60,-10},{-40,10}})));
equation
connect(ramp.y, combiTable1Ds.u) annotation (Line(points={{-39,0},{-12,0}}, color={0,0,127}));
annotation (uses(Modelica(version="4.0.0")));
end MyModel;
Then change the value of fileName in every loop.
Here we assume that there are three .mat files available in the workspace, named First.mat, Second.mat and Third.mat.
function batchSim
input String fileNames[:] = {"First", "Second", "Third"};
algorithm
for f in fileNames loop
simulateModel("MyModel(fileName=\""+f+".mat\")", stopTime=1, resultFile="Full_Year_Simulation_"+f);
end for;
annotation(__Dymola_interactive=true);
end batchSim;
This works quite well, but the downside is that the model will be recompiled in every iteration of the for loop, as the modifier has changed. If this is a big problem, define all file paths in a string vector in the model and add an integer parameter for the index. Then use the command simulateExtendedModel and change only the index via the parameters initialNames and initialValues.

Building on the answer by marco (so same model and same external files) an alternative is to make a script such as:
fileNames := {"First", "Second", "Third"};
Advanced.AllowStringParameters:=true;
translateModel("MyModel");
for f in fileNames loop
fileName:=f;
simulateModel("MyModel", stopTime=1, resultFile="Full_Year_Simulation_"+f);
end for;
Unfortunately it seems you cannot turn that into a function.

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.

After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

Use Pentaho Variable to Dynamically name EXCEL file

I am trying to dynamically name an excel file after processing it for archiving purposes.
If I process Logistics.xlsx I want to save it as U:\Archive\${varDP}.xlsx
Resulting file name U:\Archive\20190709.xlsx
I have tried Get system variable to get the date, This works fine. I have created the field (DateProcessed). However, I am unable to Set variables varDP to DateProcessed.
Thank you

You cannot set and use a variable in the same transformation. If you want to use a variable you should have a job with two transformations: first transformation gets the date and sets the variable; second transformation can then use the variable.
The main reason for that is that all steps initialise at the same time. Therefore, when the variable is read by the step that is using it, it's probably not set yet.

For these cases of Variables usage and passing parameters, i've been forwarding this previous answer, it has a link to another answer of mine where i go step by step of how to pass parameters to another Transformation without 'Set Variables', and in the linked answerm i have sent a downloadable example.

In kettle what is output field in java script and how to use setVariable in it

In kettle what is output field in java script and how to use setVariable in it.I tried to set variable in it but it gave me error

The javascript step takes the input from the previous steps and can be accessed from the input field. If you want to pass the same field to the output, you need to us the output field.
Also if you want to set the variable in javascript step, you can use
setVariable("variablename","value","type");

they are two different things.
the javascript if connected in a stream , gets as input all the fields (columns)
and can manipulate them with regular javascript.
if you want a new variable that will be a part of the stream all you need to do is:
var X;
then you can write this X as output at the bottom of the step.
give it a name and use it
so if you use something like
x = fieldA + fieldB
you can use the x on the stream.
the set variables used for setting a variable in one job to use in another job
its more like global / public in programming.
if you want to learn more about it you can take my course
just click pentaho kettle tutorial there is a lesson (video) on both steps

SSIS save string variable to text file

It seems like it should be simple but as of yet I havent found a way to save the value stored in an SSIS string variable to a text file. I've looked at using the flat file destination inside of a data flow but that requires a data flow source.
Any ideas on how to do this?

Use a script task.
I just tried this. I created a File connection manager, with the connection string pointing to the file I wanted to write to. I then created a string variable containing the text to write.
I added a Script Task, specified my string variable in the Read Only Variables list, then clicked Edit Script. The script was as follows:
public void Main()
{
ConnectionManager cm = Dts.Connections["File.tmp"];
var path = cm.ConnectionString;
var textToWrite = (string)Dts.Variables["User::StringVariable"].Value;
System.IO.File.WriteAllText(path, textToWrite);
Dts.TaskResult = (int)ScriptResults.Success;
}
This worked with no problems.

Here's a little sample of some code that worked in a SQL CLR in C#. You'll need to use VB if you're on 2005 I believe. The script task also needs the read variable property set to MyVariable to make the value of your variable available to it.
// create a writer and open the file
TextWriter tw = new StreamWriter("\\\\server\\share$\\myfile.txt");
// write a line of text to the file
tw.WriteLine(Dts.Variables["MyVariable"].Value);
// close the stream
tw.Close();

All it takes is one line of code in a simple Script task. No other dependencies, such as a connection manager, are needed.
Here's what it would look like in C#:
public void Main()
{
string variableValue = Dts.Variables["TheVariable"].Value.ToString();
string outputFile = Dts.Variables["Path"].Value.ToString();
System.IO.File.WriteAllText(outputFile, variableValue);
Dts.TaskResult = (int)ScriptResults.Success;
}
Obviously the most important line here is the one containing the WriteAllText function call.
The Path variable should contain a full path + filename for the output file.

Ok, I have an answer that doesn't involve use of script task. Pick some oledb sql source you have that's simple and you have a lot of control over. Make a query that returns only one row. Then put this query in a string variable:
"select vara, ' var =: " + #[User:varIWantToSee] + "' as myvar from tablea where vara = 1"
Then in OLEDB source pick "SQL command from a variable"
For varIWantToSee make sure you initialize it with a lot characters or ssis makes a very small length for that column that it doesn't let you override. At run time varIWantToSee will get set and you can see it. Pump this all into a flat file destination and you are in business. Why do some people have to do this? Because some people need to know the value of the variables in the runtime environment, their laptop development doesn't show the variable values they need. In my case I was running this on an Azure environment that had the database accesses I needed to test. If I were microsoft I would create a task that shows the runtime variable value at that stage of the job by writing it to the ssis log file created when the package runs. If someone knows how to do that, please enlighten us.

It's possible to use a Derived Column transformation to write the value of a variable into a column. The problem is that it needs a source to drive it, and there's no stock data source you can use that just spits out a null row onto the pipeline.
So, either you repurpose a single-row source to drive the derived column transformation, or you do what another answer suggests, and do it with a Script source.

I did it the way you described. I already had a oledb connection manager defined so I used an OLE DB Source and used the SQL Command data access mode. I used a simple query:
select getdate() as dt
...just to get it out of the way. Now I know the date of my variable pull. Then I used a Derived Column Transform to make my package variables available and wrote it out to a flat file.
Elegant? No, but it gets the job done.

Lets say you don't want to mess with Script tasks and you don't have a database you can connect to just to issue a data source command like:
SELECT 'Some arbitrary text'
There are still several ways to use a Process task for something as simple as writing a line of text to a file. For example you can use PowerShell with an input variable built using the following expression:
"'"+REPLACE(#[User::Text],"'","''")+"' > '"+REPLACE(#[User::Filename],"'","''")+"'"
Notice I escaped the filename because single quotes are legal there. Also note I used '>' for redirecting which overwrites the file if it exists. If I wanted to append I'd use '>>'.
Initially I had trouble with this method when User::Text contained multiple lines. It turns out you need some extra EOL characters after your filename when a command spans lines. Like this:
"'"+REPLACE(#[User::Text],"'","''")+"' > '"+REPLACE(#[User::Filename],"'","''")+"'\r\n\r\n"
Using cmd.exe with echo is a bit more precarious but can also work in certain circumstances and has much less overhead.
P.S. I've noticed with some versions of PowerShell that StandardInputVariable content is ignored without this:
-Command -
in the Arguments box. A lone minus sign as a Command argument is 'magic' and documented at https://learn.microsoft.com/en-us/powershell/scripting/powershell.exe-command-line-help. I believe all versions of PowerShell accept this param so even if it's not required for your version you may want to include it since it shouldn't break anything and may keep your code from breaking if PowerShell is updated to a version that requires it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas