Multiple server builds with the same image - one will not run a specific crontab line - scripting

We have a watchdog script that runs on a number of servers that will restart some services if they fail. We have them running successfully on a number of sites, however we have one site which will not run the crontab entry that triggers the watchdog. If we run the entry from the command line - it works fine
When the watchdog is installed it puts the following line into crontab. You just remove the '#' to enable it
#*/5 * * * * root /usr/local/fusion/scripts/watch_fusion_services 60
Other entries in Crontab do run - it's just this one line
I have done the following to try and resolve the issue
Removed crontab entries for the watchdog and reinstalled watchdog
Checked syslog getting this error:
Error: bad hour; while reading /etc/crontab
Changed the crontab line to be 6 minutes instead of 5 (as there was another cron job running every 5 minutes as well at this site)
Syslog error no longer occurring, however watchdog still does not work via crontab. No error messages in syslog
tested running the crontab line from the command prompt - this works okay
Attempted same process on a test VM - worked okay
Attempted same process in live environment - tested okay
checked versions of the ICA - both the same - GNU/Linux 3.13.0-117-generic x86_64
ran ntqp -p on server that is having issues - time is 'LOCAL'.
typed entry by hand - same issues occurring
I could try rebooting the server - but it seems a bit extreme for one crontab entry not working
Does anyone have any ideas about this one?

There was a line of junk in crontab that we all missed. One of the other technicians I work with pointed it as he walked past our department.
Sometimes you just can't see the wood from the trees when you are looking at a problem too long

Related

ConEmu stopped working with Bash on Windows

I have been using ConEmu for some time to run 'bash for windows', but all of the sudden it stopped working. When I launch a bash tab it now says:
ConEmuC: Root process was alive less than 10 sec, ExitCode=0.
Press Enter or Esc to close console...
I think this means that it launched something, but the shell closed quickly. I haven't changed anything that I can recall with the system, it just started doing that.
Windows version: 10 (build 15063)
ConEmu version: 180626 preview
No windows updates were done recently.
If I launch 'Bash on Ubuntu on Windows' from the start bar, it works fine (but I prefer using conemu as a terminal). Also, I can launch a cmd tab in conemu, then type bash and the bash shell will start. Unfortunately, launching bash in this way results in a lack of mouse support, and strange arrow key behavior.
I am at a bit of a loss. It worked perfectly previously... then it just stopped. Any help on how to debug or fix this would be appreciated.
If anybody finds this with a similar problem, my solution was to make a new task in ConEmu, with the command:
%windir%\system32\bash -l -i -cur_console:p5
The key is the -cur_console:p5, which sets the pty type as:
p[N] - pty modes, N - bitmask: 1 - XTermKeys, 2 - BrPaste, 4 - AppCursorKeys; default is 5 (1+4)
For some reason, setting p5 is necessary for me, despite it being listed as the default. Still not sure what changed, but at least it is working now.
I have the same problem but (for what it's worth) found that when it happens (varying between intermittent and incessant) restarting my machine fixes the issue every time.

Flink job started from another program on YARN fails with "JobClientActor seems to have died"

I'm new flink user and I have the following problem.
I use flink on YARN cluster to transfer related data extracted from RDBMS to HBase.
I write flink batch application on java with multiple ExecutionEnvironments (one per RDB table to transfer table rows in parrallel) to transfer table by table sequentially (because call of env.execute() is blocking).
I start YARN session like this
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/yarn-session.sh -n 1 -s 4 -d -jm 2048 -tm 8096
Then I run my application on YARN session started via shell script transfer.sh. Its content is here
#!/bin/bash
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/flink run -p 4 transfer.jar
When I start this script from command line manually it works fine - jobs are submitted to YARN session one by one without errors.
Now I should be able to run this script from another java program.
For this aim I use
Runtime.exec("transfer.sh");
(maybe are there better ways to do this? I have seen at REST API but there are some difficulties because job manager is proxied by YARN).
At the beginning is works as usually - first several jobs are submitted to session and finished successfully. But the following jobs are not submitted to YARN session.
In /opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log I see error (and no another errors found in DEBUG level)
The program execution failed: JobClientActor seems to have died before the JobExecutionResult could be retrieved.
I have tried to analyse this problem by myself and found out that this error has occurred in JobClient class while sending ping request with timeout to JobClientActor (i.e. YARN cluster).
I tried to increase multiple heartbeat and timeout options like akka.*.timeout, akka.watch.heartbeat.* and yarn.heartbeat-delay options but it doesn't solve the problem - new jobs are not submit to YARN session from CliFrontend.
The environment for both case (manual call and call from another program) is the same. When I call
$ ps axu | grep transfer
it will give me output
/usr/lib/jvm/java-8-oracle/bin/java -Dlog.file=/opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log -Dlog4j.configuration=file:/opt/flink-1.3.1/conf/log4j-cli.properties -Dlogback.configurationFile=file:/opt/flink-1.3.1/conf/logback.xml -classpath /opt/flink-1.3.1/lib/flink-metrics-graphite-1.3.1.jar:/opt/flink-1.3.1/lib/flink-python_2.11-1.3.1.jar:/opt/flink-1.3.1/lib/flink-shaded-hadoop2-uber-1.3.1.jar:/opt/flink-1.3.1/lib/log4j-1.2.17.jar:/opt/flink-1.3.1/lib/slf4j-log4j12-1.7.7.jar:/opt/flink-1.3.1/lib/flink-dist_2.11-1.3.1.jar:::/etc/hadoop/conf org.apache.flink.client.CliFrontend run -p 4 transfer.jar
I also tried to update flink to 1.4.0 release or change parallelism of job (even to -p 1) but error has still occurred.
I have no idea what could be different? Is any workaround by the way?
Thank you for any help.
Finally I find out how to resolve that error
Just replace Runtime.exec(...) with new ProcessBuilder(...).inheritIO().start().
I really don't know why the call of inheritIO helps in that case because as I understand it just redirects IO streams from child process to parent process.
But I have checked that if I comment out this line of code the program begins to fall again.

Running a crontab job from locally stored script

Having trouble running a crontab psql backup job from a locally stored script. I added the job via crontab -e and when I used crontab -l, it shows up in the list of jobs. The script that it is supposed to run works fine, checked that, runs as it should and dumps the output on the designated s3 bucket when using ./backup.sh
This is what I set the job as:
59 23 * * 7 /Users/myusername/backup.sh
The job should run at 11:59PM every Sunday, but it doesn't. I can't figure out what the issue is (do I need to leave line breaks/spaces in between each job, or just after the very lost job in my crontab list?
Any help would be very much appreciated. Thanks.
Depending on your distribution, you might want to check logs for Cron service.
Non-exhaustive list of possible problem reasons:
Cron service is not running at all and hence is not starting any of the tasks;
Usually Cron passes your script a very limited set of environment variables, so your script might fail because of some missing environment. That will probably be reflected in cron daemon logs
What can you do
Cron service: if your distro uses systemd then try running systemctl status cron (or systemctl status crond?) to check if it is running.
Your script is started but fails: here are several things to try.
Try checking cron service logs, maybe with something like journalctl --unit cron or journalctl -f before the script should be started;
Check if there is a dead.letter file in your home directory containing output of the failed script. When Cron starts your script and the script outputs something (which is considered a problem), that output is mailed to you. If mailing is not properly configured then it usually goes to that file.
Put something like this in the beginning of your script:
(
date
id -a
set
echo
) >> /tmp/myscript.log
Then wait until cron runs your script and check if the file /tmp/myscript.log was created. Then try to run your script manually, replicating all the environment created by cron which you now know. I.e. unset all but the variables Cron leaves, and make sure id is correct.

How to run forever without sudo on Ec2

The title is pretty much tells what the question is about. I am trying to use forever to start a script on Ec2 but it does not work unless I use sudo.
If I start without sudo, I get
warn: --minUptime not set. Defaulting to: 1000ms
warn: --spinSleepTime not set. Your script will exit if it does not stay up for at least 1000ms
info: Forever processing file: ci.js
But when I do forever list
info: No forever processes running
You should run forever list under same user you've started forever (it seems like you are doing that right).
Try to check ps aux | grep node after you do forever start. Maybe you haven't started any process (because of errors in command line or in your NodeJS file) so forever list returns empty list.
p.s. I've checked forever on my machine and it is behaving exactly as you said - if i run it under my 'ubuntu' user -> list of running is empty even though process is alive... Seem like a bug in forever.

Jenkins SSH remote process is getting killed as soon as the Jenkins SSH plugin returns back

Jenkins version: 1.574
I created a simple job which performs the following:
Using "Execute shell script on remote host using SSH" as one of the BUILD steps, I'm just calling a shell script. This shell script performs stop and start operations on Tomcat to restart an application on the target machine.
I have a valid username, password, port defined for the target SSH server in Jenkins Global settings.
I saw this behavior that when I run a Jenkins job and call the restart script (which gets the application name as parameter $1), it works fine, but as soon as "Execute shell script on remote host using SSH" step completes, I see the new process dies on the remote/target application server.
If I run the script from the target/remote server itself, everything works fine and the new process/PID remains live forever, but running the same script from Jenkins, though I don't see any errors and everything works as expected, the new process dies as soon as the above mentioned SSH step is complete and control comes back to the next BUILD step in Jenkins job OR the Jenkins job is complete.
I saw a few posts/blogs and tried setting: BUILD_ID=dontKillMe in the Jenkins job (in various places i.e. Prepare Environment variables and also using Inject Environment variables...). When the job's particular build# is complete, I can see Environment Variables for that build# does say BUILD_ID=dontKillMe as its value (instead of the default Timestamp tag value).
I tried putting nohup before calling the restart script, i.e.,
nohup restart_tomcat.sh "${app}"
I also tried:
BUILD_ID=dontKillMe nohup restart_tomcat.sh "${app}"
This doesn't give any error and creates a nohup.out file on the remote server (but I'm not worried about it as the restart_tomcat.sh script itself creates its own LOG file which I'm "cat"ing after the restart_tomcat.sh script is complete. cat'ing on the log file is performed using another "Execute shell script on remote host using SSH" build step, and it successfully shows the log file created by the restart script).
I don't know what I'm missing at this point, but as soon as the restart_tomcat.sh step is complete, the new PID/process on the remote/target server dies.
How can I fix this?
I've been through this myself.
On my first iteration, before I knew about Jenkins ProcessTreeKiller, I ended up just daemonizing Tomcat. The Apache Tomcat documentation includes a section on running as a daemon.
You can also try disabling the ProcessTreeKiller for your whole Jenkins instance, if it's relatively small (read the first link for information).
The BUILD_ID=dontKillMe should be passed to the shell, and therefore it should be in your command line, not in Jenkins global configuration or job parameters.
BUILD_ID=dontKillMe restart_tomcat.sh "${app}" should have worked without problems.
You can also try nohup restart_tomcat.sh "${app}" & with the & at the end.
My solution (it worked after trying everything else) in Ubuntu 14.04 (Trusty Tahr) (Amazon AWS - Amazon EC2), Jenkins 1.601:
Exec command: (setsid COMMAND < /dev/null > /dev/null 2>&1 &);
Exec in PTY: DISABLED
// Example COMMAND=socat TCP4-LISTEN:1337,fork TCP4:127.0.0.1:1338
I created this Transfer as my last one.
#!/bin/ksh
export BUILD_ID=dontKillMe
I added the above line to the start of my script and the issue was resolved.