When running Snakemake on a cluster, jobs get scheduled fine via slurm. Sometimes I have a case that one job is failing and consequently leads to a stop of the snakemake instance/run after completion of the still running jobs. To speed up this I have stopped snakemake (CTRl+C) and restarted it. What I did not thought of was that in this case some jobs from the previous run might still be running on the cluster. Hence it could potentially happen that the same job is started again in case no output has been written until then. In this case it could finally lead to the situation where 2 jobs write to the same output file. Or is that prevented by some other log of snakemake to care about successful completion?
I hope you can follow this explanation. Happy for every comment !
In this case it could finally lead to the situation where 2 jobs write to the same output file.
Snakemake should be aware that the previous execution didn't exit clean (because of Ctrl+C) and the jobs that were running at that moment are incomplete or absent. However, snakemake cannot know that those pending jobs are still running as independent processes.
So yes, I think it can happen that jobs steps on each other feet in what you are doing.
In my opinion, before re-running snakemake it would be safer to kill the pending jobs and start fresh. (Those that have completed before snakemake was killed are ok of course).
Note that there is an option in snakemake that may help you:
--keep-going, -k Go on with independent jobs if a job fails. (default:
False)
Related
We have a test that takes several hours to run and that we'd like to run on our codebase as often as possible in GitLab CI. The idea is for it to validate as many commits as possible by merging them from dev into main but we know it's too slow to run on every commit.
It could run on a schedule, e.g. 18:00 every evening, but then it would run unnecessarily if there have been no changes and it wouldn't run as often as it could, e.g. 2-3 times a day.
Limiting concurrent jobs as suggested here isn't enough because the jobs will pile up, one per commit, and there will never be time to run them all.
We'd like it to complete the test for one commit, and then restart on the latest commit available, skipping over any commits that came in earlier.
I've looked through the rules section of the docs but don't see any magic variables that would let me say "run this job if it's not already running". Perhaps some kind of semaphore as described here (requested but not implemented as far as I can see).
How can we tell GitLab CI to run this particular job only if it's not already running and skip the job otherwise?
My workflow often includes PBS job submissions to a shared cluster that need to either wait in the scheduling queue, take over 24hrs to run or both. I'd like to run snakemake in the 'background' and get my prompt back while these jobs are running. I know this can be done using tmux, screen, or & but is there is a better way to do this?
I guess submitting a bash wrapper script with the snakemake commands inside is an option but I think I'm lacking some understanding of the workflow.
tmux is the recommended way to execute a Snakemake workflow. It will give you all you need, regardless of whether you are in a cluster or on a compute server.
It's been more than an hour, and the job is still running, I guess it is dead already, what I was doing is very simple:
I have two very small textfiles, and I imported them to hdfs already and would like to practice some pig latin operations. Here is what I did:
1. I created two relations, one for each
2. I created a co-grouping
3. I tried to get a dump
The dump lasted for more than an hours now, I checked a few times in GUI, and found the same job has been ended and started again:
1. completed 50%
Started again and hanging
btw: what the heck is Dr. Who showing in this screenshot (top right corner):
In this case you may want to kill the job, the command is:
yarn -kill application_xxxxxx
and refresh the queue after the job is killed:
yarn rmadmin -refreshQueue
Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()
Is there a plugin or can I somehow configure it, that a job (that is triggered by 3 other jobs) queues until a specified time and only then executes the whole queue?
Our case is this:
we have tests run for 3 branches
each of the 3 build jobs for those branches triggers the same smoke-test-job that runs immediately
each of the 3 build jobs for those branches triggers the same complete-test-job
points 1. and 2. work perfectly fine.
The complete-test-job should queue the tests all day long and just execute them in the evening or at night (starting from a defined time like 6 pm), so that the tests are run at night and during the day the job is silent.
It's no option to trigger the complete-test-job on a specified time with the newest version. we absolutely need the trigger of the upstream build-job (because of promotion plugin and we do not want to run already run versions again).
That seems a rather strange request. Why queue a build if you don't want it now... And if you want a build later, then you shouldn't be triggering it now.
You can use Jenkins Exclusion plugin. Have your test jobs use a certain resource. Make another job whose task is to "hold" the resource during the day. While the resource is in use, the test jobs won't run.
Problem with this: you are going to kill your executors by having queued non-executing jobs, and there won't be free executors for other jobs.
Haven't tried it myself, but this sounds like a solution to your problem.