Pentaho Data Integration: The job keeps running even though it has succeeded - pentaho

I have a simple job of moving data from source to destination with some transformations. Most of the times the job succeeds without any issues. But lately when I run the job, it kind of gets stuck in the last with the hourglass symbol stating that the job is still in progress, whereas it has actually completed and data is present in the destination. Then I myself have to stop the job. And when I do so, the last job shows the green tick mark.
I want the job to successfully run without any intervention.

I couldn't see "success" step in your screenshot. If you have missed that please add success step at end. Otherwise it wont be stopped.

Related

pentaho job stops in the middle of a transformation without any indication in log file

I'm new in using pentaho and I need your help to investigate a problem.
I have scheduled in crontab to run a job by kitchen command. I'm using pentaho release 6.0.1.0.386.
Sometimes (it's not a deterministic problem) one of the transformation stops after "Loading transformation from repository" and before "Dispatching started for transformation". The log interrupts. No errors. Nothing. And the job doesn't go on.
Any idea? Any check I can do ? Thanks
is so many bigger the quantity data in this transformation?
There are some files that can cause some errors, you can find them in this path:
enter image description here
my computer/users / your user / .kettle
If you delete the ones I marked in the image, they will be created automatically when you open the pentaho again.

Not able to test a batch job result using selenium takes 20-25 minutes to complete

I have to do the following in my test.
Login to the site
click on a button to upload an excel file. It will kick off a task
Wait for job to complete (takes 20-25 minutes)
Verify the output of job (Successful, Failed)
Excel file contains data based on which job will either return success or failure.
I am trying to do this using selenium but the problem is that in the framework we can execute a test for max 10 minutes else it kills the test.
So not sure how to go about automating this scenario.
One of the way i could see is
First Test --> Just upload the file and kick off the task
Second Test (depends on First Test) --> verify the results.
Have a you faced similar situation.
Could any of you please suggest the approach for this? or can suggest alternative tool that i could use to automate this scenario.

SSIS execution reports show delay in reports generation and moving to next step

Suppose a job has 2 packages. Package1 has 10 steps and package 2
has 2. After a job is triggered, and assume that SSIS reports show
that Step 4 is in progress, while in actual step 8 is also complete.
Why this delay in reflection in SSIS execution reports ?
Now suppose all the 10 steps are actually completed, and after some hours of actual completion the reports also show all the 10 steps as successful. Ideally after this the Package2 must start executing, but this does not happen for next 5 to 10 hours.
Any suggestion on this would be helpful.
1.
The SSIS GUI is not great for displaying live updates as to how your package is progressing with large packages. Sometimes SSIS shows red when successful and green when not, sometimes it shows TBC steps as green when the current step is still white. it's something you will just have to put up with until Microsoft fix it.
The best thing you can do (i have found) to see exactly what step your package is at is to setup a table in your database and update it with step & start/finish times. This way you can monitor your table to see when your steps are actually completing.
2.
This depends entirely on how you are triggering the second package to execute, I'd recommend the last step of the first package would be to execute the next one, as long as there are no other dependencies.

Dummy step is not work in Job

Each transformation will create an csv file in a folder, and I want to upload all of them when transformations done. I add a Dummy but the process didn't work as my expectation. Each transformation will execute Hadoop Copy Files step. Why? And how could I design the flow? Thanks.
First of all, if possible, try launching the .ktr files in parallel (right click on the START Step > Click on Launch Next Entries in parallel). This will ensure that all the ktr are launched parallely.
Secondly, You can choose either of the below steps depending upon your feasibility (instead of dummy step):
"Checks if files exist" Step: Before moving to the Hadoop step, you can do a small check if all the files has been properly created and then proceed with your execution.
"Wait For" Step: You can give some time to wait for all the step to complete before moving to the next entry. I don't suggest this since the time of writing a csv file might vary, unless you are totally sure of some time.
"Evaluate files metrics" : Check the count of the files before moving forward. In your case check if the file count is 9 or not.
I just wanted to do a some sort of checking on the files before you copy the data to HDFS.
Hope it helps :)
You cannot join the transformations like you do.
Each transformation, upon success, will follow to the Dummy step, so it'll be called for EVERY transformation.
If you want to wait until the last transformation finishes to run only once the Hadoop copy files step you need to do one of two things:
Run the transformations in a sequence, where each ktr will be called upon success of the previous one (slower)
As suggested in another answer, launch the KTRs in parallel, but with one caveat: they need to be called from a sub-job. Here's the idea:
Your main job has a start, calls a sub-job and upon success, calls the Hadoop copy files step.
Your sub-job has a start, from which all transformations are called in different flows. You use the "Launch next entries in parallel" so all are launched at once.
The sub-job will keep running until the last transformation finishes and only then the flow is passed to the Hadoop copy files step, which will only be launched once.

How to save a smart folder jobs marked ok in BMC Control-M

Suppose I have a smart folder X having 5 jobs with multiple dependencies. For example, let us assume the job hierarchy is like this:
So, from Planning tab, I order this smart folder for execution. Since I don't want to wait for Job 202 to execute, as it a tape backup job which is not needed in the environment I am working in, I mark Job 202 as "OK" in the monitoring tab. For Job 302, it is a pre-requisite that Job 202 ends "OK".
In a similar set up, I have hundreds of jobs with similar dependencies. I have to order the folder from time to time, and have to manually mark all the jobs that are not required to run as "OK". I cannot simply remove the jobs that I need to mark ok as they have dependencies with the other jobs I want to execute.
My question is - How can I do this once - that is mark ok all unnecessary jobs - and save this for all future instances when I am going to run the workload?
If the job you mentioned as Job202 is not that important for Job302 to start, then this should be independent. Remove if from the flow and make it independant. Make this changes in Control M Desktop and write it to database. You will not have to make the changes daily.
For all jobs not required in the "environment" you are testing, you can check the check box for "run as dummy" to convert those jobs to "dummy" jobs while maintaining the structure, relationships, and dependencies in your folder. A dummy job will not execute the command, script, etc, rather the dummy job will only provide control-m the instructions on the post-processing steps of the job OR in your case the adding of conditions to continue processing the job flow after the dummy job.
(I realize this is an old question; I provided a response should it be helpful to anyone that finds this thread after me)