Batch job submission failed: I/O error writing script/environment to file - jobs

I installed slurm on a workstation and it seemed to work, i can use the slurm commands, srun is working too.
But when i try to launch a job from a script using sbatch test.sh i get the following error : Batch job submission failed: I/O error writing script/environment to file even if the script is the simplest like
#!/bin/bash
srun hostname

Make sure slurmd is running as root. See the SlurmdUser parameter in slurm.conf. Its default value is root and it should be so.
Note this is different from the SlurmUser parameter, that defines the user which runs the controller processes ; this one is preferably not root.
If the configuration is correct, then you might have a faulty filesystem at the location referred to in the SlurmdSpoolDir parameter, where slurmd writes the submission script and environment for jobs assigned to the node.

Related

Keep Snakemake jobscript after a failed cluster execution?

I have a workflow that runs correctly on my own machine, but fails when being submitted to a cluster. The error seems to be a shell problem on the jobscript written where the environment variables are set:
/nfs/mypath/test/.snakemake/tmp.m80omc9q/snakejob.retrieve_data.3.sh:
line 3: NEW_ACCESSION='E-CORN-1' ACCESSIONS='E-MTAB-4395,E-MTAB-4342,E-MTAB-4128,E-MTAB-3826,E-MTAB-3173,E-MTAB-964,E-GEOD-62778,E-GEOD-54272'
COVARIATE_TYPE='characteristic'
BATCH='study' COVARIATE='organism part'
CACHE_PATH='/other-path/baseline-merge-cache': No such file or directory
however, because the jobscript gets deleted, I cannot check the shell script to understand the issue. Is there any way of keeping this file after snakemake finishes/fails for inspection?

LLVM profiling on child process

I want to extract execution traces (e.g., visited basic blocks) when testing Apache server (httpd). Since my work is based on LLVM infrastructure, I choose to use clang instrumentation based profiling as follows:
clang -fprofile-instr-generate ${options to compile httpd} -o httpd
export LLVM_PROFILE_FILE=code-%p.profraw
sudo -E ./httpd -k start # output a .profraw
curl ${url} # send a request
sudo -E ./httpd -k stop # output another .profraw
The compilation of instrumented httpd works well.
However, I want to track httpd's request handling which is executed in a separate child process. The output .profraw does not record any execution from child processes. As a result, I can only access the execution traces of starting and closing the server. How can I get the .profraw including request handling?
Not restricted in clang profiling. Any solution compatible with LLVM is great. Thanks!
Update
From the logs, it turns out the child process whose owner is "daemon" has no write permission to the files
LLVM Profile Error: Failed to write file "code-94752.profraw": Permission denied
Problem solved
The problem is the collision of prof file names. The process httpd -k start create multiple child processes as workers. When using LLVM_PROFILE_FILE=code-%p.profraw, their pid %p is same. So the main process is owned by root and creates the prof file first. Then latter process owned by daemon cannot write that file.
Solution: Use LLVM_PROFILE_FILE=code-%9m.profraw (%Nm instead of %p) to avoid name collisions.

Flink job started from another program on YARN fails with "JobClientActor seems to have died"

I'm new flink user and I have the following problem.
I use flink on YARN cluster to transfer related data extracted from RDBMS to HBase.
I write flink batch application on java with multiple ExecutionEnvironments (one per RDB table to transfer table rows in parrallel) to transfer table by table sequentially (because call of env.execute() is blocking).
I start YARN session like this
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/yarn-session.sh -n 1 -s 4 -d -jm 2048 -tm 8096
Then I run my application on YARN session started via shell script transfer.sh. Its content is here
#!/bin/bash
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/flink run -p 4 transfer.jar
When I start this script from command line manually it works fine - jobs are submitted to YARN session one by one without errors.
Now I should be able to run this script from another java program.
For this aim I use
Runtime.exec("transfer.sh");
(maybe are there better ways to do this? I have seen at REST API but there are some difficulties because job manager is proxied by YARN).
At the beginning is works as usually - first several jobs are submitted to session and finished successfully. But the following jobs are not submitted to YARN session.
In /opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log I see error (and no another errors found in DEBUG level)
The program execution failed: JobClientActor seems to have died before the JobExecutionResult could be retrieved.
I have tried to analyse this problem by myself and found out that this error has occurred in JobClient class while sending ping request with timeout to JobClientActor (i.e. YARN cluster).
I tried to increase multiple heartbeat and timeout options like akka.*.timeout, akka.watch.heartbeat.* and yarn.heartbeat-delay options but it doesn't solve the problem - new jobs are not submit to YARN session from CliFrontend.
The environment for both case (manual call and call from another program) is the same. When I call
$ ps axu | grep transfer
it will give me output
/usr/lib/jvm/java-8-oracle/bin/java -Dlog.file=/opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log -Dlog4j.configuration=file:/opt/flink-1.3.1/conf/log4j-cli.properties -Dlogback.configurationFile=file:/opt/flink-1.3.1/conf/logback.xml -classpath /opt/flink-1.3.1/lib/flink-metrics-graphite-1.3.1.jar:/opt/flink-1.3.1/lib/flink-python_2.11-1.3.1.jar:/opt/flink-1.3.1/lib/flink-shaded-hadoop2-uber-1.3.1.jar:/opt/flink-1.3.1/lib/log4j-1.2.17.jar:/opt/flink-1.3.1/lib/slf4j-log4j12-1.7.7.jar:/opt/flink-1.3.1/lib/flink-dist_2.11-1.3.1.jar:::/etc/hadoop/conf org.apache.flink.client.CliFrontend run -p 4 transfer.jar
I also tried to update flink to 1.4.0 release or change parallelism of job (even to -p 1) but error has still occurred.
I have no idea what could be different? Is any workaround by the way?
Thank you for any help.
Finally I find out how to resolve that error
Just replace Runtime.exec(...) with new ProcessBuilder(...).inheritIO().start().
I really don't know why the call of inheritIO helps in that case because as I understand it just redirects IO streams from child process to parent process.
But I have checked that if I comment out this line of code the program begins to fall again.

Running a crontab job from locally stored script

Having trouble running a crontab psql backup job from a locally stored script. I added the job via crontab -e and when I used crontab -l, it shows up in the list of jobs. The script that it is supposed to run works fine, checked that, runs as it should and dumps the output on the designated s3 bucket when using ./backup.sh
This is what I set the job as:
59 23 * * 7 /Users/myusername/backup.sh
The job should run at 11:59PM every Sunday, but it doesn't. I can't figure out what the issue is (do I need to leave line breaks/spaces in between each job, or just after the very lost job in my crontab list?
Any help would be very much appreciated. Thanks.
Depending on your distribution, you might want to check logs for Cron service.
Non-exhaustive list of possible problem reasons:
Cron service is not running at all and hence is not starting any of the tasks;
Usually Cron passes your script a very limited set of environment variables, so your script might fail because of some missing environment. That will probably be reflected in cron daemon logs
What can you do
Cron service: if your distro uses systemd then try running systemctl status cron (or systemctl status crond?) to check if it is running.
Your script is started but fails: here are several things to try.
Try checking cron service logs, maybe with something like journalctl --unit cron or journalctl -f before the script should be started;
Check if there is a dead.letter file in your home directory containing output of the failed script. When Cron starts your script and the script outputs something (which is considered a problem), that output is mailed to you. If mailing is not properly configured then it usually goes to that file.
Put something like this in the beginning of your script:
(
date
id -a
set
echo
) >> /tmp/myscript.log
Then wait until cron runs your script and check if the file /tmp/myscript.log was created. Then try to run your script manually, replicating all the environment created by cron which you now know. I.e. unset all but the variables Cron leaves, and make sure id is correct.

Jenkins SSH remote process is getting killed as soon as the Jenkins SSH plugin returns back

Jenkins version: 1.574
I created a simple job which performs the following:
Using "Execute shell script on remote host using SSH" as one of the BUILD steps, I'm just calling a shell script. This shell script performs stop and start operations on Tomcat to restart an application on the target machine.
I have a valid username, password, port defined for the target SSH server in Jenkins Global settings.
I saw this behavior that when I run a Jenkins job and call the restart script (which gets the application name as parameter $1), it works fine, but as soon as "Execute shell script on remote host using SSH" step completes, I see the new process dies on the remote/target application server.
If I run the script from the target/remote server itself, everything works fine and the new process/PID remains live forever, but running the same script from Jenkins, though I don't see any errors and everything works as expected, the new process dies as soon as the above mentioned SSH step is complete and control comes back to the next BUILD step in Jenkins job OR the Jenkins job is complete.
I saw a few posts/blogs and tried setting: BUILD_ID=dontKillMe in the Jenkins job (in various places i.e. Prepare Environment variables and also using Inject Environment variables...). When the job's particular build# is complete, I can see Environment Variables for that build# does say BUILD_ID=dontKillMe as its value (instead of the default Timestamp tag value).
I tried putting nohup before calling the restart script, i.e.,
nohup restart_tomcat.sh "${app}"
I also tried:
BUILD_ID=dontKillMe nohup restart_tomcat.sh "${app}"
This doesn't give any error and creates a nohup.out file on the remote server (but I'm not worried about it as the restart_tomcat.sh script itself creates its own LOG file which I'm "cat"ing after the restart_tomcat.sh script is complete. cat'ing on the log file is performed using another "Execute shell script on remote host using SSH" build step, and it successfully shows the log file created by the restart script).
I don't know what I'm missing at this point, but as soon as the restart_tomcat.sh step is complete, the new PID/process on the remote/target server dies.
How can I fix this?
I've been through this myself.
On my first iteration, before I knew about Jenkins ProcessTreeKiller, I ended up just daemonizing Tomcat. The Apache Tomcat documentation includes a section on running as a daemon.
You can also try disabling the ProcessTreeKiller for your whole Jenkins instance, if it's relatively small (read the first link for information).
The BUILD_ID=dontKillMe should be passed to the shell, and therefore it should be in your command line, not in Jenkins global configuration or job parameters.
BUILD_ID=dontKillMe restart_tomcat.sh "${app}" should have worked without problems.
You can also try nohup restart_tomcat.sh "${app}" & with the & at the end.
My solution (it worked after trying everything else) in Ubuntu 14.04 (Trusty Tahr) (Amazon AWS - Amazon EC2), Jenkins 1.601:
Exec command: (setsid COMMAND < /dev/null > /dev/null 2>&1 &);
Exec in PTY: DISABLED
// Example COMMAND=socat TCP4-LISTEN:1337,fork TCP4:127.0.0.1:1338
I created this Transfer as my last one.
#!/bin/ksh
export BUILD_ID=dontKillMe
I added the above line to the start of my script and the issue was resolved.