Use of Enable blocking in PDI - Pig Script Executor - apache-pig

I am exploring Big data plugin in Pentaho 5.2. I was trying to run Pig Script executor. I am unable to understand the usage of
Enabling Blocking. The PDI documentation says that
If checked, the Pig Script Executor job entry will prevent downstream
entries from executing until the script has finished processing.
I am aware that running a pig script will convert the execution to Map reduce jobs. I am running the job with Start job -> Pig Script. If I disable the Enable blocking step I am unable to execute the script. I am getting permission denied errors. As per the documentation " ".
What does downstream mean here. I do not pass any hops from the pig script out. I am unable to understand the Enable blocking step. Any hints can be helpful and will be appreciated.

Enable blocking: the task is deployed to the Hadoop cluster; PDI will follow up on progress and only proceed with the rest of the job tasks AFTER the execution of the Hadoop job finishes;
Enable blocking is disabled: PDI deploys the task to the Hadoop cluster and forgets about it. The rest of the job tasks proceed immediately after the cluster accepts the task, but doesn't wait for it to complete.

Related

Tivoli Workload Scheduler WAPL to restart failed job from step

Is it possible to restart a failed Job in TWS on z/OS from either a particular step or the entire Job using WAPL.
I am trying to automate the restart from Jenkins using WAPL and was unable to find the right syntax.
Thanks

Issue with executing spark sql job using oozie action

Facing a weird issue, trying to execute a spark-sql(Spark2) job using oozie action but the behavior of execution is quite weird, at times it executes fine but sometimes it continues to be in "Running" state forever, on checking the logs got the below issue.
WARN org.apache.spark.scheduler.cluster.YarnClusterScheduler` - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
The strange thing is that we have already provided sufficient resources, the same can be seen from spark environment variables as well and as well under the cluster resources(cluster has sufficient cores and RAM).
<spark-opts>--executor-memory 10G --num-executors 7 --executor-cores 3 --driver-memory 8G --driver-cores 2</spark-opts>
With the same configuration sometimes it is executing fine as well. Are we missing something?
The issue was related to jar conflict,following are the suggestions to identify the same.
a)Check the maven dependency tree to make sure there is no transitive dependency conflict.
b)While spark job is running check the environment variables being used using Spark UI.
c)Resolve the conflict and run a maven clean package.

Flink job started from another program on YARN fails with "JobClientActor seems to have died"

I'm new flink user and I have the following problem.
I use flink on YARN cluster to transfer related data extracted from RDBMS to HBase.
I write flink batch application on java with multiple ExecutionEnvironments (one per RDB table to transfer table rows in parrallel) to transfer table by table sequentially (because call of env.execute() is blocking).
I start YARN session like this
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/yarn-session.sh -n 1 -s 4 -d -jm 2048 -tm 8096
Then I run my application on YARN session started via shell script transfer.sh. Its content is here
#!/bin/bash
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/flink run -p 4 transfer.jar
When I start this script from command line manually it works fine - jobs are submitted to YARN session one by one without errors.
Now I should be able to run this script from another java program.
For this aim I use
Runtime.exec("transfer.sh");
(maybe are there better ways to do this? I have seen at REST API but there are some difficulties because job manager is proxied by YARN).
At the beginning is works as usually - first several jobs are submitted to session and finished successfully. But the following jobs are not submitted to YARN session.
In /opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log I see error (and no another errors found in DEBUG level)
The program execution failed: JobClientActor seems to have died before the JobExecutionResult could be retrieved.
I have tried to analyse this problem by myself and found out that this error has occurred in JobClient class while sending ping request with timeout to JobClientActor (i.e. YARN cluster).
I tried to increase multiple heartbeat and timeout options like akka.*.timeout, akka.watch.heartbeat.* and yarn.heartbeat-delay options but it doesn't solve the problem - new jobs are not submit to YARN session from CliFrontend.
The environment for both case (manual call and call from another program) is the same. When I call
$ ps axu | grep transfer
it will give me output
/usr/lib/jvm/java-8-oracle/bin/java -Dlog.file=/opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log -Dlog4j.configuration=file:/opt/flink-1.3.1/conf/log4j-cli.properties -Dlogback.configurationFile=file:/opt/flink-1.3.1/conf/logback.xml -classpath /opt/flink-1.3.1/lib/flink-metrics-graphite-1.3.1.jar:/opt/flink-1.3.1/lib/flink-python_2.11-1.3.1.jar:/opt/flink-1.3.1/lib/flink-shaded-hadoop2-uber-1.3.1.jar:/opt/flink-1.3.1/lib/log4j-1.2.17.jar:/opt/flink-1.3.1/lib/slf4j-log4j12-1.7.7.jar:/opt/flink-1.3.1/lib/flink-dist_2.11-1.3.1.jar:::/etc/hadoop/conf org.apache.flink.client.CliFrontend run -p 4 transfer.jar
I also tried to update flink to 1.4.0 release or change parallelism of job (even to -p 1) but error has still occurred.
I have no idea what could be different? Is any workaround by the way?
Thank you for any help.
Finally I find out how to resolve that error
Just replace Runtime.exec(...) with new ProcessBuilder(...).inheritIO().start().
I really don't know why the call of inheritIO helps in that case because as I understand it just redirects IO streams from child process to parent process.
But I have checked that if I comment out this line of code the program begins to fall again.

Shouldn't apache specific cron-jobs run in docker image?

In the Best practices for running Docker guide it's stated, that there should only run one process per docker container. In Ubuntu there are some cron-jobs related to the apache-httpd which run daily (located in the/etc/cron.daily/apache2).
When using the apache-docker-image from the official repository (look here) those cronjobs are not run, only the httpd process is started, cron is not running.
Shouldn't the cron-jobs stated above be executed?
I have a hard time to figure out, how one can execute this cron-jobs from another docker-image, as suggested in the "Best-practices-guide" since the "cron-docker-image" should have access to the apache-process in order to run the cron-jobs correctly.
For basic apache there are no cron jobs to run.
If you have cron jobs to run there is no "right answer".
If they run daily and only run for a certain amount of time, you could certainly just schedule those to run instead of using cron.
If they run more periodically or you dont have a scheduler that can handle that (like AWS lambda) then it's not against best practices to have your webserver run them as a cron, you would just have to build your own container off of apache's to handle it.
If your real question is "How do I run cron jobs" a quick google brought:
https://github.com/aptible/docker-cron-example
https://hub.docker.com/r/hamiltont/docker-cron/
https://getcarina.com/docs/tutorials/schedule-tasks-cron/
You would just modify those to run in the background with & or nohup
What have you tried?

Bamboo Jobs kick off from Rundeck and Execute Bamboo jobs from command Prompt

Have the following requirments.
Execute a Bamboo Job from RunDeck. ( I found plugins to execute Rundeck job from Bamboo, need to vice versa)
Call the jobs created in Bamboo by Command Prompt ( Thinking to execute the jobs using command prompt in Rundeck)
Please suggest any alternatives for the above task. Utilmate goal is to get the bamboo jobs kick off from Rundeck.
I would suggest using the REST API provided by Atlassian. Documentation can be found here and, more specific to your use case, here.
After you've got the correct API call(s) to trigger your Bamboo job, just add that as a curl step to the bottom of your rundeck job and it should do what you need.
FWIW - I've done this for Jenkins & Rundeck, but never in bamboo, but the solution should be the same since they're very similar products.