How do I save the output path of Hadoop reducers to a variable?
This variable will be used by all other MR jobs.
These jobs will be sequential.
All the sequential MR jobs will write their corresponding output to that output directory.
I need their path variable to be updated accordingly.
Take a look at "Oozie". It's a Hadoop workflow engine which allows just what you described. Multiple jobs can take their "Input" as an "Output" from the last job.
There are other solutions for this such as "Cascading" API.
http://www.concurrentinc.com/products/
http://yahoo.github.com/oozie/releases/2.0.0/#Quick_Start
Related
Now I have a pentaho job using shell script to process some data.
But I found if I want to use the result in the script, I had to write that into a file and read the file to asign variables.
Is there an esaier way to use the result of a script step in the following steps?
This is the Script content.
Here is the whole process.
In pentaho you cannot create a variable and used in the same place.
Basically you just need to create one ktr and one job:
the first one is in charge of perform some task and save the variable with set-variable step (root job level option)
variables created in the first ktr are also available at job level
If you want to use the variable in another ktr
the second ktr, at the beginning should use the get-variable step to retrieve the variable created in previous transformation
transformations should be executed sequentially using a job
In your case, you should run the shell in the first ktr, transform the result into a variable and save it using set-variable. Your job which invokes the ktr, are able to use the variable create in the previous ktr
Pig uses variables to store the data.
When I load the data from HDFS into the variable in pig. Where is the data temporarily stored?
What exactly happens in the background when we load the data into the variable ?
Kindy help
Pig lazily evaluates most expressions. In most cases, it checks for syntax errors etc. Like,
a = load 'hdfs://I/Dont/Exist'
won't throw an error unless you use STORE or DUMP or something along those lines which result in the evaluation of a
Similarly, if a file exists and you load it to a relation and perform transformations on it, the file is spooled to /tmp folder usually and then the transformations are performed. If you look at the messages that appear when you run commands on grunt, you'll notice file paths starting with file:///tmp/xxxxxx_201706171047235. These are the files that store intermediate data.
When you log a job in Pentaho Data Integration, one of the fields is ID_JOB, described as "the batch id- a unique number increased by one for each run of a job."
Can I get this ID? I can see it in my logging tables, but I want to set up a transformation to get it. I think there might be a runtime variable that holds an ID for the running job.
I've tried using the Get Variables and Get System Info transformation steps to no avail. I am a new Kettle user.
You have batch_ids of the current transformation and of the parent job available on the Get System Info step. On PDI 5.0 they come before the "command line arguments", but order changes with each version, so you may have to look it up.
You need to create the variable yourself to house the parent job batch ID. The way to do this is to add another transformation as the first step in your job that sets the variable and makes it available to all the other subsequent transformations and job steps that you'll call from the job. Steps:
1) As you have probably already done, enable logging on the job
JOB SETTINGS -> SETTINGS -> CHECK: PASS BATCH ID
JOB SETTINGS -> LOG -> ENABLE LOGGING, DEFINE DATABASE LOG TABLE, ENABLE: ID_JOB FIELD
2) Add a new transformation call it "Set Variable" as the first step after the start of your job
3) Create a variable that will be accessible to all your other transformations that contains the value of the current jobs batch id
3a) ADD A GET SYSTEM INFO STEP. GIVE A NAME TO YOUR FIELD - "parentJobBatchID" AND TYPE OF "parent job batch ID"
3b) ADD A SET VARIABLES STEP AFTER THE GET SYSTEM INFO STEP. DRAW A HOP FROM THE GET SYSTEM INFO STEP TO THE SET VARIABLES STEP AS ITS MAIN OUTPUT
3c) IN THE SET VARIABLES STEP SET FIELDNAME: "parentJobBatchID", SET A VARIABLE NAME - "myJobBatchID", VARIABLE SCOPE TYPE "Valid in the Java Virtual Machine", LEAVE DEFAULT VALUE EMPTY
And that's it. After that, you can go back to your job and add subsequent transformations and steps and they will all be able to access the variable you defined by substituting ${myJobBatchID} or whatever you chose to name it.
IT IS IMPORTANT THAT THE SET VARIABLES STEP IS THE ONLY THING THAT HAPPENS IN THE "Set Variables" TRANSFORMATION AND ANYTHING ELSE YOU WANT TO ACCESS THAT VARIABLE IS ADDED ONLY TO OTHER TRANSFORMATIONS CALLED BY THE JOB. This is because transformations in Pentaho are multi-threaded and you cannot guarantee that the set variables step will happen before other activities in that transformation. The parent job, however, executes sequentially so you can be assured that once you establish the variable containing parent job batch ID in the first transformation of the job that all other transformaitons and job steps will be able to use that variable.
You can test that it worked before you add other functionality by adding a "Write To Log" step after the Set Variables transformation that writes the variable ${myJobBatchID} to the log for you to view and confirm it is working.
I have files abc.xlsx, 1234.xlsx, and xyz.xlsx in some folder. My requirement is to develop a transformation where the Microsoft Excel Input in PDI (Pentaho Data Integration) should only pick the file based on the output of a sql query. If the output query is abc.xlsx. Microsoft Excel Input should pick of abc.xlsx for further processing. How do I achieve this? Would really appreciate your help. Thanks.
Transformations in Kettle run asynchronously, so you're probably looking into needing a job for this.
Files to create
Create a transformation that performs the SQL query you're looking for and populates a variable based on the result
Create a transformation that pulls data from the Excel file, using the variable populated as the filename
Create a job that executes the first transformation, then steps into the second transformation
Jobs run sequentially, so it will execute the first transformation, perform the query, get the result, and set a variable. Variables need to be set and retrieved in different transformations because of their asynchronous nature. This is the reason for the second transformation; the job won't step into the second transformation until the first one is done running (therefore, not until the variable is populated).
This is all assuming you only want to run the transformation once, expecting a single result from the query. If you want to loop it, pulling data from a set, then setup is a little bit different.
The Excel input step has a "accept filenames from previous step" option. You can have a table input build the full path of the file you want to read (or you somehow build it later knowing the base dir and the short filename), pass the filename to the excel input, tick that box and specify the step and the field you want to use for the filename.
Currently, when I STORE into HDFS, it creates many part files.
Is there any way to store out to a single CSV file?
You can do this in a few ways:
To set the number of reducers for all Pig opeations, you can use the default_parallel property - but this means every single step will use a single reducer, decreasing throughput:
set default_parallel 1;
Prior to calling STORE, if one of the operations execute is (COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN (outer), and ORDER BY), then you can use the PARALLEL 1 keyword to denote the use of a single reducer to complete that command:
GROUP a BY grp PARALLEL 1;
See Pig Cookbook - Parallel Features for more information
You can also use Hadoop's getmerge command to merge all those part-* files.
This is only possible if you run your Pig scripts from the Pig shell (and not from Java).
This as an advantage over the proposed solution: as you can still use several reducers to process your data, so your job may run faster, especially if each reducer output few data.
grunt> fs -getmerge <Pig output file> <local file>