I am new with kettle so I am going to run the 3 kettle script 1.ktr,2.ktr,3.ktr one after other.
Can someone give me the idea how to achive this using kettle steps.
usually, you organize your kettle transformations within kettle jobs (.kjb). In those jobs you can have transformations being processed one after the other. You can also include jobs within jobs to further organize your ETL process. If you execute your jobs and transformations from the command line, please be aware that you execute transformations with the tool kitchen, transformations with pan. You can create jobs like you can create transformations, with spoon.
Ideal way is to use Jobs. Jobs guarantee sequential execution when you put them in sequence unlike a transformation calling multiple transformations through transformation executor (where it goes in parallel)
Create a wrapper job with Start -> (transformation step) KTR1 -> (transformation step) KTR2 -> (transformation step) KTR3 -> Success and run this job.
Create a job. Drag three "Transformation" steps in that job. Check this out:
You can add as many transformations you want in a job. When you will run this job, it will execute the transformations one by one.
Related
I have created a few Athena queries to generate reports.
Business wants these reports run nightly and have the output of the query emailed to them?
My first step is to schedule the execution of the saved/named Athena queries so that I can collect the output of the query execution from the S3 buckets.
Is there a way to automate the execution of the queries on a periodic basis?
You can schedule events in AWS using Lambda (see this tutorial). From Lambda you can run just about anything, including trigger some Athena query.
Introduction
To keep it simple, let's imagine a simple transformation.
This transformation gets an input of 4 rows, from a Data Grid step.
The stream passes through a Job Executor, referencing to a simple job, with a Write Log component.
Expectations
I would like the simple job executes 4 times, that means 4 log messages.
Results
It turns out that the Job Executor step launches the simple job only once, instead of 4 times : I only have one log message.
Hints
The documentation of the Job Executor component specifies the following :
By default the specified job will be executed once for each input row.
This is parametrized in the "Row grouping" tab, with the following field :
The number of rows to send to the job: after every X rows the job will be executed and these X rows will be passed to the job.
Answer
The step actually works well : an input of X rows will execute a "Job Executor" step X times. The fact is I wasn't able to see it with the logs.
To verify it, I have added a simple transformation inside the "Job Executor" step, which writes into a text file. After I have checked this file, it appeared that the "Job Executor" was perfectly executed X times.
Research
Trying to understand why I didn't have X log messages after the X times execution of "Job Executor", I have added a "Wait for" component inside the initial simple job. Finally, adding two seconds allowed me to see X log messages appearing during the execution.
Hope this helps because it's pretty tricky. Please feel free to provide further details.
A little late to the party, as a side note:
Pentaho is a set of programs (Spoon, Kettle, Chef, Pan, Kitchen), The engine is Kettle, and everything inside transformations is started in parallel. This makes log retrieving a challenging task for Spoon (the UI), you don't actually need a Wait for entry, try outputting the logs into a file (specifying a log file on the Job executor entry properties) and you'll see everything in place.
Sometimes we need to give Spoon a little bit of time to get everything in place, personally that's why I recommend not relying on Spoon Execution Results logging tab, it is better to output the logs to a DB or files.
I have a SSIS job that contains 4 packages to perform ETL job
in the first package we use "select newid()" function to create an unique ID for that ETL process
wondering how can I pass that variable value to all ETL package so all 4 package can use the same ID
Execute package task is out of the picture cause we want the job to have 4 steps (perform by 4 packages)
Can anyone point me a direction ?
Thanks
I would create a "Master Control Package" to generate the ID, store it in a variable, then use Execute Package Tasks to call the sub-packages.
This makes your SQL Agent job definition simpler (1 step calling the MCP), and you can leverage out-of-the-box features of SSIS like parallel execution, conditional execution and restart control via checkpoints.
I am newbee to pig .
I have written a small script in pig , where in i first load the data from two different tables and further right outer join the two tables ,later also i have next join of tables for two different st of data .It works fine .But i want to see
the steps of execution , like in which step my data is loaded that way i can note the time
needed for loading later details of step for data joining like how much time it is
taking for these much records to be joined .
Basically i want to know which part of my pig script is taking longer time to run so
that way i can further optimize my pig script .
Anyway we could println within the script and find which steps got executed which has started to execute .
Through jobtracker details link i could not get much info , just could see mapper is running & reducer is running , but idealy mapper for which part of script is running could not find that.
For example for a hive job run we can see in the jobtracker details link which step is currently getting executed.
Any information will be really helpfull.
Thanks in advance .
I'd suggest you to have a look at the followings:
Pig's Progress Notification Listener
Penny : this is a monitoring tool but I'm afraid that it hasn't been updated in the recent past (e.g: it won't compile for Pig 0.12.0 unless you do some code changes)
Twitter's Ambrose project. https://github.com/twitter/ambrose
On the other, after executing the script you can see a detailed statistics about the execution time of each alias (see: Job Stats (time in seconds)).
Have a look at the EXPLAIN operator. This doesn't give you real-time stats as your code is executing, but it should give you enough information about the MapReduce plan your script generates that you'll be able to match up the MR jobs with the steps in your script.
Also, while your script is running you can inspect the configuration of the Hadoop job. Look at the variables "pig.alias" and "pig.job.feature". These tell you, respectively, which of your aliases (tables/relations) is involved in that job and what Pig operations are being used (e.g., HASH_JOIN for a JOIN step, SAMPLER or ORDER BY for an ORDER BY step, and so on). This information is also available in the job stats that are output to the console upon completion.
I have a pentaho transformation which is consist of, for example, 10 steps. I want to start this job for N input parameters but not in parallel, each job evaluation should start after previous transformation are fully completed(process done in transaction and commited or rollbacked). Is it possible with Pentaho?
You can add 'Block this step until steps finish' from Flow to your transformation. Or you can mix 'Wait for SQL' component from Utility with loop on your job.
Regards
Mateusz
Maybe you must do it using jobs instead of transformations. Jobs only run on sequence while transformations run on parallel. (Truly, a transformation has a initialize phase whose run is in parallel and then the flow runs sequentially).
If you can't use jobs, you always can do what Matusz said.