Flink job started from another program on YARN fails with "JobClientActor seems to have died" - hadoop-yarn

I'm new flink user and I have the following problem.
I use flink on YARN cluster to transfer related data extracted from RDBMS to HBase.
I write flink batch application on java with multiple ExecutionEnvironments (one per RDB table to transfer table rows in parrallel) to transfer table by table sequentially (because call of env.execute() is blocking).
I start YARN session like this
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/yarn-session.sh -n 1 -s 4 -d -jm 2048 -tm 8096
Then I run my application on YARN session started via shell script transfer.sh. Its content is here
#!/bin/bash
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/flink run -p 4 transfer.jar
When I start this script from command line manually it works fine - jobs are submitted to YARN session one by one without errors.
Now I should be able to run this script from another java program.
For this aim I use
Runtime.exec("transfer.sh");
(maybe are there better ways to do this? I have seen at REST API but there are some difficulties because job manager is proxied by YARN).
At the beginning is works as usually - first several jobs are submitted to session and finished successfully. But the following jobs are not submitted to YARN session.
In /opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log I see error (and no another errors found in DEBUG level)
The program execution failed: JobClientActor seems to have died before the JobExecutionResult could be retrieved.
I have tried to analyse this problem by myself and found out that this error has occurred in JobClient class while sending ping request with timeout to JobClientActor (i.e. YARN cluster).
I tried to increase multiple heartbeat and timeout options like akka.*.timeout, akka.watch.heartbeat.* and yarn.heartbeat-delay options but it doesn't solve the problem - new jobs are not submit to YARN session from CliFrontend.
The environment for both case (manual call and call from another program) is the same. When I call
$ ps axu | grep transfer
it will give me output
/usr/lib/jvm/java-8-oracle/bin/java -Dlog.file=/opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log -Dlog4j.configuration=file:/opt/flink-1.3.1/conf/log4j-cli.properties -Dlogback.configurationFile=file:/opt/flink-1.3.1/conf/logback.xml -classpath /opt/flink-1.3.1/lib/flink-metrics-graphite-1.3.1.jar:/opt/flink-1.3.1/lib/flink-python_2.11-1.3.1.jar:/opt/flink-1.3.1/lib/flink-shaded-hadoop2-uber-1.3.1.jar:/opt/flink-1.3.1/lib/log4j-1.2.17.jar:/opt/flink-1.3.1/lib/slf4j-log4j12-1.7.7.jar:/opt/flink-1.3.1/lib/flink-dist_2.11-1.3.1.jar:::/etc/hadoop/conf org.apache.flink.client.CliFrontend run -p 4 transfer.jar
I also tried to update flink to 1.4.0 release or change parallelism of job (even to -p 1) but error has still occurred.
I have no idea what could be different? Is any workaround by the way?
Thank you for any help.

Finally I find out how to resolve that error
Just replace Runtime.exec(...) with new ProcessBuilder(...).inheritIO().start().
I really don't know why the call of inheritIO helps in that case because as I understand it just redirects IO streams from child process to parent process.
But I have checked that if I comment out this line of code the program begins to fall again.

Related

Batch job submission failed: I/O error writing script/environment to file

I installed slurm on a workstation and it seemed to work, i can use the slurm commands, srun is working too.
But when i try to launch a job from a script using sbatch test.sh i get the following error : Batch job submission failed: I/O error writing script/environment to file even if the script is the simplest like
#!/bin/bash
srun hostname
Make sure slurmd is running as root. See the SlurmdUser parameter in slurm.conf. Its default value is root and it should be so.
Note this is different from the SlurmUser parameter, that defines the user which runs the controller processes ; this one is preferably not root.
If the configuration is correct, then you might have a faulty filesystem at the location referred to in the SlurmdSpoolDir parameter, where slurmd writes the submission script and environment for jobs assigned to the node.

How to have dbt run with log error FileNotFound to trigger an exit code that isn't 0?

We're running dbt version 0.16.1. We've set up our data pipeline to run in Airflow, and have a library set up to map each dbt model run within it's own bash operator on Airflow.
The dbt run command executed is as follows:
cd /usr/local/airflow/models/[PACKAGE_NAME] && dbt --log-format json run --models [MODEL_NAME]--no-version-check --profiles-dir=/usr/local/airflow/dags/dags-enterprise-model/enterprise_model/include --target=[TARGET] --profile=[PROFILE]
Occasionally (likely when two models are being run at the same time), Airflow will show the following message from within the dbt run command:
INFO - FileNotFoundError: [Errno 2] No such file or directory: 'logs/dbt.log' -> 'logs/dbt.log.1'
This is problematic because the logfiles do not get updated, but the exit code of the task is listed a 0:
Command exited with return code 0
This causes Airflow to mark the task as a success; however, the log wasn't printed successfully.
My questions:
Is there a a way for these errors to be raised as an actual error?
Failing that, is there a way to specific a unique log file?
I'm not sure if this is a gap in my understand, a bug within dbt's logging, or maybe both?
It definitely sounds like this is the result of invoking dbt multiple times simultaneously, while having it write to the same files. It's not a dbt bug because we don't intend for dbt to be invoked simultaneously; a single invocation can handle concurrent model runs via threads. Log collisions are one risk of reimplementing dbt's model DAG as Airflow DAGs.
Those are both fair questions:
Historically, dbt only used two log levels: debug and info. See the comment on a related issue: dbt#2680. I totally appreciate that Airflow and other orchestration tools have well defined notification behaviors when presented with different log levels. A community member actually just opened a PR to add error-level logging (dbt#2723).
It is possible to set a custom log path for a dbt invocation using the log-path config in dbt_project.yml (docs)

SSH command step not working for one command - in Jmeter

I have a unique problem using jmeter SSH command.
I use this step to run spark jobs.
the problem is that one of the commands not working, to clarify it connects and not get response and just wait and wait for hours, and nothing displayed on screen.
I know how to work with the tool, and this behavior is special for this script alone.
All other script worked, I duplicate one that worked for example
sudo /run_stg.sh this command worked
sudo /run_off2-stg.sh this command not worked
if I run the job manually via jenkins it worked
if I entered to command line and use plik ssh it worked,
the problem is just Jmeter, that is waiting and waiting and I can not understand for what?
the job is about 3 minutes, and I wait for response in Jmeter for 4 hours and nothing Jmeter just waiting.
in the console log I set to trace level and nothing, absolutely no idea how to start handle this issue in Jmeter.
an anyone please assists how to make Jmeter to write what happened?
or just to know if he connect or anything
since this behavior all the test can not be performed
Most probably you are as usual misconfiguring the SSH Command sampler.
The idea is not to run the script per se, you need to delegate the script execution to the Unix Shell, for example Bash this way you will be able to combine several commands together, see the output, amend debugging level, etc.
So I would recommend setting your command to something like /bin/bash -c -x /your/script.sh
Another guess, given you use sudo it might be the case that the sudo command simply waits for the password (which JMeter never provides), if this is the case try amending your script permissions using chmod command and allowing your user its execution without root privileges.
And finally, given you're able to run your command using "plik ssh" (whatever it is) you can run it using OS Process Sampler
More information: How to Run External Commands and Programs Locally and Remotely from JMeter

Running a crontab job from locally stored script

Having trouble running a crontab psql backup job from a locally stored script. I added the job via crontab -e and when I used crontab -l, it shows up in the list of jobs. The script that it is supposed to run works fine, checked that, runs as it should and dumps the output on the designated s3 bucket when using ./backup.sh
This is what I set the job as:
59 23 * * 7 /Users/myusername/backup.sh
The job should run at 11:59PM every Sunday, but it doesn't. I can't figure out what the issue is (do I need to leave line breaks/spaces in between each job, or just after the very lost job in my crontab list?
Any help would be very much appreciated. Thanks.
Depending on your distribution, you might want to check logs for Cron service.
Non-exhaustive list of possible problem reasons:
Cron service is not running at all and hence is not starting any of the tasks;
Usually Cron passes your script a very limited set of environment variables, so your script might fail because of some missing environment. That will probably be reflected in cron daemon logs
What can you do
Cron service: if your distro uses systemd then try running systemctl status cron (or systemctl status crond?) to check if it is running.
Your script is started but fails: here are several things to try.
Try checking cron service logs, maybe with something like journalctl --unit cron or journalctl -f before the script should be started;
Check if there is a dead.letter file in your home directory containing output of the failed script. When Cron starts your script and the script outputs something (which is considered a problem), that output is mailed to you. If mailing is not properly configured then it usually goes to that file.
Put something like this in the beginning of your script:
(
date
id -a
set
echo
) >> /tmp/myscript.log
Then wait until cron runs your script and check if the file /tmp/myscript.log was created. Then try to run your script manually, replicating all the environment created by cron which you now know. I.e. unset all but the variables Cron leaves, and make sure id is correct.

Jenkins SSH remote process is getting killed as soon as the Jenkins SSH plugin returns back

Jenkins version: 1.574
I created a simple job which performs the following:
Using "Execute shell script on remote host using SSH" as one of the BUILD steps, I'm just calling a shell script. This shell script performs stop and start operations on Tomcat to restart an application on the target machine.
I have a valid username, password, port defined for the target SSH server in Jenkins Global settings.
I saw this behavior that when I run a Jenkins job and call the restart script (which gets the application name as parameter $1), it works fine, but as soon as "Execute shell script on remote host using SSH" step completes, I see the new process dies on the remote/target application server.
If I run the script from the target/remote server itself, everything works fine and the new process/PID remains live forever, but running the same script from Jenkins, though I don't see any errors and everything works as expected, the new process dies as soon as the above mentioned SSH step is complete and control comes back to the next BUILD step in Jenkins job OR the Jenkins job is complete.
I saw a few posts/blogs and tried setting: BUILD_ID=dontKillMe in the Jenkins job (in various places i.e. Prepare Environment variables and also using Inject Environment variables...). When the job's particular build# is complete, I can see Environment Variables for that build# does say BUILD_ID=dontKillMe as its value (instead of the default Timestamp tag value).
I tried putting nohup before calling the restart script, i.e.,
nohup restart_tomcat.sh "${app}"
I also tried:
BUILD_ID=dontKillMe nohup restart_tomcat.sh "${app}"
This doesn't give any error and creates a nohup.out file on the remote server (but I'm not worried about it as the restart_tomcat.sh script itself creates its own LOG file which I'm "cat"ing after the restart_tomcat.sh script is complete. cat'ing on the log file is performed using another "Execute shell script on remote host using SSH" build step, and it successfully shows the log file created by the restart script).
I don't know what I'm missing at this point, but as soon as the restart_tomcat.sh step is complete, the new PID/process on the remote/target server dies.
How can I fix this?
I've been through this myself.
On my first iteration, before I knew about Jenkins ProcessTreeKiller, I ended up just daemonizing Tomcat. The Apache Tomcat documentation includes a section on running as a daemon.
You can also try disabling the ProcessTreeKiller for your whole Jenkins instance, if it's relatively small (read the first link for information).
The BUILD_ID=dontKillMe should be passed to the shell, and therefore it should be in your command line, not in Jenkins global configuration or job parameters.
BUILD_ID=dontKillMe restart_tomcat.sh "${app}" should have worked without problems.
You can also try nohup restart_tomcat.sh "${app}" & with the & at the end.
My solution (it worked after trying everything else) in Ubuntu 14.04 (Trusty Tahr) (Amazon AWS - Amazon EC2), Jenkins 1.601:
Exec command: (setsid COMMAND < /dev/null > /dev/null 2>&1 &);
Exec in PTY: DISABLED
// Example COMMAND=socat TCP4-LISTEN:1337,fork TCP4:127.0.0.1:1338
I created this Transfer as my last one.
#!/bin/ksh
export BUILD_ID=dontKillMe
I added the above line to the start of my script and the issue was resolved.